Sign In

Wan 2.2 14B S2V Ultimate Suite: GGUF & Lightning Speed with Extended Video Generation

20

336

9

Updated: Aug 29, 2025

base model

Type

Workflows

Stats

336

0

Reviews

Published

Aug 29, 2025

Base Model

Wan Video 2.2 I2V-A14B

Hash

AutoV2
9FD2D9125A
default creator card background decoration
zardozai's Avatar

zardozai

🎬 Introduction

Welcome to a powerhouse ComfyUI workflow designed to unlock the incredible potential of the Wan 2.2 14B Sound-to-Video (S2V) model. This isn't just a simple implementation; it's a comprehensive suite that addresses two critical needs for AI video generation: accessibility and speed.

This all-in-one workflow provides two parallel generation pipelines:

  1. ⚑ Lightning Fast (4-Step) Pipeline: Utilizes a specialized LoRA to generate videos in a fraction of the time, perfect for rapid prototyping and iteration.

  2. 🎨 High Fidelity (20-Step) Pipeline: The classic, high-quality generation process for when you demand the utmost visual fidelity from your outputs.

Crucially, both versions are configured to run using GGUF-quantized models, dramatically reducing VRAM requirements and making this massive 14B parameter model accessible to users with consumer-grade hardware.


✨ Key Features & Highlights

  • Dual Mode Operation: Choose between speed and quality with two self-contained workflows in one JSON file. Easily enable/disable either section.

  • GGUF Quantization Support: Run the massive Wan 2.2 model without needing a professional GPU. Leverages LoaderGGUF and ClipLoaderGGUF nodes.

  • Extended Video Generation: The workflow includes built-in "Video S2V Extend" subgraphs. Each one adds 77 frames. The template is pre-configured with two extenders, resulting in a ~5-second video at 16 FPS. Want a longer video? Simply copy and paste more extender nodes!

  • Audio-Driven Animation: Faithfully implements the S2V model's core function: animating a reference image in sync with an uploaded audio file (e.g., music, speech).

  • Smart First-Frame Fix: Includes a clever hack to correct the first frame, which is often "overbaked" by the VAE decoder.

  • Detailed Documentation: The workflow itself is filled with informative notes and markdown nodes explaining crucial settings like batch size and chunk length.


🧩 How It Works (The Magic Behind the Scenes)

The workflow is logically grouped into clear steps:

  1. Load Models (GGUF): The LoaderGGUF and ClipLoaderGGUF nodes load the quantized UMT5 text encoder and the main UNet model, drastically reducing VRAM load compared to full precision models.

  2. Upload Inputs: You provide two key ingredients:

    • ref_image: The starting image you want to animate (e.g., a character portrait).

    • audio: The sound file that will drive the motion and pacing of the animation.

  3. Encode Prompts & Audio: Your positive and negative prompts are processed, and the audio file is encoded into a format the model understands using the Wav2Vec2 encoder.

  4. Base Generation (WanSoundImageToVideo): The core node takes your image, audio, and prompts to generate the first latent video sequence.

  5. Extend the Video (Video S2V Extend Subgraphs): This is where the length comes from. The latent output from the previous step is fed into a sampler (KSampler) alongside the audio context again to generate the next chunk of frames. These chunks are concatenated together.

  6. Decode & Compile: The final latent representation is decoded into images by the VAE, and the CreateVideo node stitches all the frames together with the original audio to produce your final MP4 file.


βš™οΈ Instructions & Usage

Prerequisite: Download Models

You must download the following model files and place them in your ComfyUI models directory. The workflow includes handy markdown notes with direct download links.

Essential Models:

  • umt5-xxl-encoder-q4_k_m.gguf β†’ Place in /models/clip/

  • Wan2.2-S2V-14B-Q5_0.gguf β†’ Place in /models/unet/ (or /models/diffusion/)

  • wav2vec2_large_english_fp16.safetensors β†’ Place in /models/audio_encoders/

  • wan_2.1_vae.safetensors β†’ Place in /models/vae/

For the 4-Step Lightning Pipeline:

  • Wan2.2-Lightning_T2V-v1.1-A14B-4steps-lora_HIGH_fp16.safetensors β†’ Place in /models/loras/

Loading the Workflow

  1. Download the provided video_wan2_2_14B_s2v.json file.

  2. In ComfyUI, drag and drop the JSON file into the window or use the Load button.

Running the Workflow

  1. Upload Your Media:

    • In the "LoadImage" node, upload your starting reference image.

    • In the "LoadAudio" node, upload your music or audio file.

  2. Enter Your Prompt:

    • Modify the text in the "CLIP Text Encode (Positive Prompt)" node.

    • The negative prompt is already filled with a robust, standard negative.

  3. Choose Your Pipeline:

    • To use the 4-Step Lightning pipeline (Fast): Ensure the LoraLoaderModelOnly node is correctly pointed to your Lightning LoRA file. The Steps primitive node for this section is already set to 4 and CFG to 1.

    • To use the 20-Step pipeline (High Quality): The lower section of the workflow is already configured. The Steps are set to 20 and CFG to 6.0. You can simply box-select the entire 20-step section and press Ctrl+B to disable the 4-step section if you wish to only run this one.

  4. Queue Prompt! Watch as your image comes to life, driven by your audio.


⚠️ Important Notes & Tips

  • Batch Size Setting: The "Batch sizes" value (3 by default) is not a traditional batch size. It must be set to 1 + [number of Video S2V Extend subgraphs]. This workflow has 2 extenders, so the value is 3. If you add another extender, set it to 4.

  • Chunk Length: The default is 77 frames. This is a requirement of the model and should not be changed unless you know what you're doing.

  • Lightning LoRA Trade-off: The 4-step LoRA is incredibly fast but may result in a slight drop in coherence and quality compared to the 20-step generation. It's the perfect tool for finding the right seed and composition quickly.

  • GGUF vs. Safetensors: This workflow uses GGUF for the text and UNet models to save VRAM. You can replace the LoaderGGUF and ClipLoaderGGUF nodes with standard UNETLoader and CLIPLoader nodes if you have the VRAM to use the full .safetensors models, which may offer slightly better quality.


🎭 Example Results

Prompt: "The man is playing the guitar. He looks down at his hands playing the guitar and sings affectionately and gently."
Audio: A gentle acoustic guitar track.

(You would embed a short video example generated by this workflow here)



πŸ’Ž Conclusion

This workflow demystifies the process of running the formidable Wan 2.2 S2V model. By integrating GGUF support and a dual-pipeline approach, it empowers users with limited hardware to experiment and create stunning, audio-synchronized animations. Whether you're quickly iterating with the Lightning LoRA or crafting a masterpiece with the full 20-step process, this suite has you covered.

Happy generating! Feel free to leave a comment with your amazing creations or any questions.