Updated: Aug 29, 2025
base model㪠Introduction
Welcome to a powerhouse ComfyUI workflow designed to unlock the incredible potential of the Wan 2.2 14B Sound-to-Video (S2V) model. This isn't just a simple implementation; it's a comprehensive suite that addresses two critical needs for AI video generation: accessibility and speed.
This all-in-one workflow provides two parallel generation pipelines:
β‘ Lightning Fast (4-Step) Pipeline: Utilizes a specialized LoRA to generate videos in a fraction of the time, perfect for rapid prototyping and iteration.
π¨ High Fidelity (20-Step) Pipeline: The classic, high-quality generation process for when you demand the utmost visual fidelity from your outputs.
Crucially, both versions are configured to run using GGUF-quantized models, dramatically reducing VRAM requirements and making this massive 14B parameter model accessible to users with consumer-grade hardware.
β¨ Key Features & Highlights
Dual Mode Operation: Choose between speed and quality with two self-contained workflows in one JSON file. Easily enable/disable either section.
GGUF Quantization Support: Run the massive Wan 2.2 model without needing a professional GPU. Leverages
LoaderGGUF
andClipLoaderGGUF
nodes.Extended Video Generation: The workflow includes built-in "Video S2V Extend" subgraphs. Each one adds 77 frames. The template is pre-configured with two extenders, resulting in a ~5-second video at 16 FPS. Want a longer video? Simply copy and paste more extender nodes!
Audio-Driven Animation: Faithfully implements the S2V model's core function: animating a reference image in sync with an uploaded audio file (e.g., music, speech).
Smart First-Frame Fix: Includes a clever hack to correct the first frame, which is often "overbaked" by the VAE decoder.
Detailed Documentation: The workflow itself is filled with informative notes and markdown nodes explaining crucial settings like batch size and chunk length.
π§© How It Works (The Magic Behind the Scenes)
The workflow is logically grouped into clear steps:
Load Models (GGUF): The
LoaderGGUF
andClipLoaderGGUF
nodes load the quantized UMT5 text encoder and the main UNet model, drastically reducing VRAM load compared to full precision models.Upload Inputs: You provide two key ingredients:
ref_image
: The starting image you want to animate (e.g., a character portrait).audio
: The sound file that will drive the motion and pacing of the animation.
Encode Prompts & Audio: Your positive and negative prompts are processed, and the audio file is encoded into a format the model understands using the Wav2Vec2 encoder.
Base Generation (
WanSoundImageToVideo
): The core node takes your image, audio, and prompts to generate the first latent video sequence.Extend the Video (
Video S2V Extend
Subgraphs): This is where the length comes from. The latent output from the previous step is fed into a sampler (KSampler) alongside the audio context again to generate the next chunk of frames. These chunks are concatenated together.Decode & Compile: The final latent representation is decoded into images by the VAE, and the
CreateVideo
node stitches all the frames together with the original audio to produce your final MP4 file.
βοΈ Instructions & Usage
Prerequisite: Download Models
You must download the following model files and place them in your ComfyUI models
directory. The workflow includes handy markdown notes with direct download links.
Essential Models:
umt5-xxl-encoder-q4_k_m.gguf
β Place in/models/clip/
Wan2.2-S2V-14B-Q5_0.gguf
β Place in/models/unet/
(or/models/diffusion/
)wav2vec2_large_english_fp16.safetensors
β Place in/models/audio_encoders/
wan_2.1_vae.safetensors
β Place in/models/vae/
For the 4-Step Lightning Pipeline:
Wan2.2-Lightning_T2V-v1.1-A14B-4steps-lora_HIGH_fp16.safetensors
β Place in/models/loras/
Loading the Workflow
Download the provided
video_wan2_2_14B_s2v.json
file.In ComfyUI, drag and drop the JSON file into the window or use the Load button.
Running the Workflow
Upload Your Media:
In the "LoadImage" node, upload your starting reference image.
In the "LoadAudio" node, upload your music or audio file.
Enter Your Prompt:
Modify the text in the "CLIP Text Encode (Positive Prompt)" node.
The negative prompt is already filled with a robust, standard negative.
Choose Your Pipeline:
To use the 4-Step Lightning pipeline (Fast): Ensure the
LoraLoaderModelOnly
node is correctly pointed to your Lightning LoRA file. TheSteps
primitive node for this section is already set to4
andCFG
to1
.To use the 20-Step pipeline (High Quality): The lower section of the workflow is already configured. The
Steps
are set to20
andCFG
to6.0
. You can simply box-select the entire 20-step section and pressCtrl+B
to disable the 4-step section if you wish to only run this one.
Queue Prompt! Watch as your image comes to life, driven by your audio.
β οΈ Important Notes & Tips
Batch Size Setting: The "Batch sizes" value (
3
by default) is not a traditional batch size. It must be set to1 + [number of Video S2V Extend subgraphs]
. This workflow has 2 extenders, so the value is 3. If you add another extender, set it to 4.Chunk Length: The default is
77
frames. This is a requirement of the model and should not be changed unless you know what you're doing.Lightning LoRA Trade-off: The 4-step LoRA is incredibly fast but may result in a slight drop in coherence and quality compared to the 20-step generation. It's the perfect tool for finding the right seed and composition quickly.
GGUF vs. Safetensors: This workflow uses GGUF for the text and UNet models to save VRAM. You can replace the
LoaderGGUF
andClipLoaderGGUF
nodes with standardUNETLoader
andCLIPLoader
nodes if you have the VRAM to use the full.safetensors
models, which may offer slightly better quality.
π Example Results
Prompt: "The man is playing the guitar. He looks down at his hands playing the guitar and sings affectionately and gently."
Audio: A gentle acoustic guitar track.
(You would embed a short video example generated by this workflow here)
π Download & Links
Download this Workflow JSON:
[Link to your uploaded JSON file]
Official Wan 2.2 Model Repo: HuggingFace - Comfy-Org/Wan_2.2_ComfyUI_Repackaged
Required GGUF Models: Search for
Wan2.2-S2V-14B-Q5_0.gguf
andumt5-xxl-encoder-q4_k_m.gguf
on Hugging Face.
π Conclusion
This workflow demystifies the process of running the formidable Wan 2.2 S2V model. By integrating GGUF support and a dual-pipeline approach, it empowers users with limited hardware to experiment and create stunning, audio-synchronized animations. Whether you're quickly iterating with the Lightning LoRA or crafting a masterpiece with the full 20-step process, this suite has you covered.
Happy generating! Feel free to leave a comment with your amazing creations or any questions.