Advanced audio generation ultra-long video workflow

1. Introduction

Just upload a picture and audio, and she can speak or sing. The Wan2.2 S2V model can generate continuous videos of over 15 seconds at a time. I optimized the workflow and the result was excellent. The workflow I built is in the attachment. Everyone can download and experience it.

2. Model Links

You can find the models inour repodiffusion_models

audio_encoders

wav2vec2_large_english_fp16.safetensors

vae

wan_2.1_vae.safetensors

text_encoders

umt5_xxl_fp8_e4m3fn_scaled.safetensors

ComfyUI/

├───📂 models/

│ ├───📂 diffusion_models/

│ │ ├─── wan2.2_s2v_14B_fp8_scaled.safetensors

│ │ └─── wan2.2_s2v_14B_bf16.safetensors

│ ├───📂 text_encoders/

│ │ └─── umt5_xxl_fp8_e4m3fn_scaled.safetensors

│ ├───📂 audio_encoders/ # Create one if you

can't find this folder

│ │ └─── wav2vec2_large_english_fp16.safetensors

│ └───📂 vae/

│ └── wan_2.1_vae.safetensors

3. Usage Method

Load Image: Upload reference image
LoadAudio: Upload your own audio
Set the Duration: The duration of the generated video is measured in seconds.
If bypassed, it will be generated based on the duration of the entire audio segment
Set the resolution: The default is 400, which is a total resolution meaning 400,000 pixels. It is a resolution size of 480P.
If you want to generate a 720P video, change it to 920.
Set the Frame rate: The default is 16.
Set the Chunk Length: The default is 77.
Enter Prompt.
Use Ctrl-Enter or click the Run button to execute the workflow.

Follow me, If you have any questions, you can leave me a message:

bilibili @AI_小兵哥
YouTube @AIXBG_fp8