1. Introduction
Just upload a picture and audio, and she can speak or sing. The Wan2.2 S2V model can generate continuous videos of over 15 seconds at a time. I optimized the workflow and the result was excellent. The workflow I built is in the attachment. Everyone can download and experience it.
2. Model Links
You can find the models inour repodiffusion_models
audio_encoders
vae
text_encoders
ComfyUI/
├───📂 models/
│ ├───📂 diffusion_models/
│ │ ├─── wan2.2_s2v_14B_fp8_scaled.safetensors
│ │ └─── wan2.2_s2v_14B_bf16.safetensors
│ ├───📂 text_encoders/
│ │ └─── umt5_xxl_fp8_e4m3fn_scaled.safetensors
│ ├───📂 audio_encoders/ # Create one if you
can't find this folder
│ │ └─── wav2vec2_large_english_fp16.safetensors
│ └───📂 vae/
│ └── wan_2.1_vae.safetensors
3. Usage Method
Load Image: Upload reference image
LoadAudio: Upload your own audio
Set the Duration: The duration of the generated video is measured in seconds.
If bypassed, it will be generated based on the duration of the entire audio segment
Set the resolution: The default is 400, which is a total resolution meaning 400,000 pixels. It is a resolution size of 480P.
If you want to generate a 720P video, change it to 920.
Set the Frame rate: The default is 16.
Set the Chunk Length: The default is 77.
Enter Prompt.
Use Ctrl-Enter or click the Run button to execute the workflow.
Follow me, If you have any questions, you can leave me a message:
bilibili @AI_小兵哥
YouTube @AIXBG_fp8


