Sign In

Advanced audio generation ultra-long video workflow

4

Advanced audio generation ultra-long video workflow

1. Introduction

Just upload a picture and audio, and she can speak or sing. The Wan2.2 S2V model can generate continuous videos of over 15 seconds at a time. I optimized the workflow and the result was excellent. The workflow I built is in the attachment. Everyone can download and experience it.

2. Model Links

You can find the models inour repodiffusion_models

audio_encoders

vae

text_encoders

ComfyUI/

├───📂 models/

│ ├───📂 diffusion_models/

│ │ ├─── wan2.2_s2v_14B_fp8_scaled.safetensors

│ │ └─── wan2.2_s2v_14B_bf16.safetensors

│ ├───📂 text_encoders/

│ │ └─── umt5_xxl_fp8_e4m3fn_scaled.safetensors

│ ├───📂 audio_encoders/ # Create one if you

can't find this folder

│ │ └─── wav2vec2_large_english_fp16.safetensors

│ └───📂 vae/

│ └── wan_2.1_vae.safetensors

3. Usage Method

  1. Load Image: Upload reference image

  2. LoadAudio: Upload your own audio

  3. Set the Duration: The duration of the generated video is measured in seconds.

    If bypassed, it will be generated based on the duration of the entire audio segment

  4. Set the resolution: The default is 400, which is a total resolution meaning 400,000 pixels. It is a resolution size of 480P.

    If you want to generate a 720P video, change it to 920.

  5. Set the Frame rate: The default is 16.

  6. Set the Chunk Length: The default is 77.

  7. Enter Prompt.

  8. Use Ctrl-Enter or click the Run button to execute the workflow.

Follow me, If you have any questions, you can leave me a message:

bilibili @AI_小兵哥
YouTube @AIXBG_fp8

4