Updated: Sep 23, 2025
base modelINSTALL RES4LYF !!!!
A highly optimized ComfyUI workflow designed to generate long, dynamic videos with pronounced motion using the Wan2.2-S2V-14B model, all in just 4 sampling steps.
This workflow is the next evolution in the "Motion Forge" series, pushing the boundaries of efficiency and length. It leverages a sophisticated chaining mechanism to extend video clips sequentially, allowing for the creation of significantly longer videos from a single reference image and audio file while maintaining high motion quality and coherence.
Model Used: Wan2.2-S2V-14B-Q8_0.gguf
📖 Description
Tired of short, low-motion clips? This workflow is your solution. It's engineered for users who want to create expressive, music-video-style animations that are longer than the standard output. By utilizing a powerful "Video S2V Extend x5" group node, the workflow takes an initial video latent and progressively builds upon it over five stages.
The key innovation here is the extremely low step count (4 steps) combined with a high CFG (6) and specialized sampling (uni_pc
, beta57
), which prioritizes fast, creative, and high-energy motion generation. It's perfect for animating to music, creating dynamic scenes, or any application where fluid, exaggerated movement is desired over photorealistic stillness.
✨ Features & Highlights
🔥 Ultra-Fast Generation: Only 4 steps per sampling pass makes the generation process remarkably quick for the video length achieved.
💥 High-Motion Output: Deliberately configured with a high CFG scale and specific sampler/scheduler to maximize movement and dynamism in the final video.
🎬 Long-Format Video: The core "Video S2V Extend x5" node chains five sequential generations, turning a base clip into a much longer sequence.
🎵 Sound-to-Video (S2V): Fully integrates audio analysis via a Wav2Vec encoder, synchronizing the visual motion to the input audio track (
DEXTER_JUSTICE.wav
in the example).🧹 Built-in Memory Management: Includes
easy cleanGpuUsed
,VRAMCleanup
, andRAMCleanup
nodes to ensure stability during long generation processes.🔧 Smart Pre-processing: Automatically resizes and prepares the reference image (
ComfyUI_02140_.png
) for optimal compatibility.🎯 Quality-of-Life Fixes: Incorporates a "stupid hack" (as noted in the workflow) to fix the first frame being "overbaked" by the VAE by duplicating and then removing it post-decoding.
🛠️ Technical Details
Workflow: ComfyUI (JSON included)
Primary Model: Wan2.2-S2V-14B-Q8_0.gguf
CLIP Model: cow-umt5xxl-q4_0.gguf
VAE: Wan2_1_VAE_fp32.safetensors
Audio Encoder: wav2vec2_large_english_fp8_e4m3fn.safetensors
LoRA: lightx2v_I2V_14B_480p_cfg_step_distill_rank128_bf16.safetensors
(Strength: 1.38)
🚀 Usage Instructions
Load the Workflow: Import the provided JSON file into ComfyUI.
Check Model Paths: Ensure the paths to the required model files (
Wan2.2-S2V-14B-Q8_0.gguf
, etc.) in theLoaderGGUF
,ClipLoaderGGUF
, andVaeGGUF
nodes point to the correct locations on your system.Input Your Media:
Reference Image: Replace the
LoadImage
node's path with your own starting image.Audio File: Replace the
LoadAudio
node's path with your own audio file (e.g., a song, dialogue, or soundscape).
Adjust Prompts: Modify the text in the
CLIP Text Encode
nodes (Positive and Negative) to describe your desired scene and exclude unwanted elements.Queue Prompt: Run the workflow! The result will be a video file combined with your audio, saved to your ComfyUI output directory.
💡 Workflow Breakdown (The "Magic" Sauce)
The workflow is logically grouped for clarity:
Step 1 - Load Models: Loads the core Wan models, VAE, and applies a specialized LoRA for enhanced performance.
Step 2 - Upload Audio & Ref Image: Feeds your source media into the pipeline.
Step 3 - Batch Settings: Sets the global parameters like batch size, chunk length, and sampling steps.
Step 4 - Prompt: Where you define the visual style and content.
Basic Sampling: The initial
WanSoundImageToVideo
andKSampler
node that creates the first short video latent from your image and audio.Video S2V Extend X5 (The Core): This custom group node is the engine of the workflow. It takes the initial video and runs it through five separate extension cycles, each with a different seed, effectively "dreaming" the video forward in time while staying conditioned on the original image and audio.
Fix Overbaked First Frame: A post-processing chain that decodes the final latent video, corrects a visual artifact on the first frame, and extracts the final video frames.
Final Combine: The
VHS_VideoCombine
node takes all the generated frames and the original audio file to render the final MP4 video.
📝 Example Prompt from the Workflow
Positive Prompt:"Professional male driver in car interior, NYC night cityscape through windows, neon lights reflection on face, subtle head turn toward passenger camera, contemplative expression, cinematic bokeh from street lights, dashboard illumination, urban atmosphere, smooth camera movement, noir aesthetic, moody lighting with blue and orange tones, 4K quality"
Negative Prompt:"色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
(Translation: Avoids over-saturation, over-exposure, static images, blurry details, text, artwork, bad quality, ugly, deformed, extra fingers, bad hands/face,畸形, cluttered background, etc.)
⚙️ Recommended Settings
For even longer videos: You can increase the "Chunk Length" or add more extension blocks by replicating the
Video S2V Extend
group.For different motion styles: Experiment with the
CFG
scale. Lower values (3-5) may produce subtler motion, while higher values (7-10) can create even more dramatic effects.If coherence breaks: Try using a slightly higher step count (6-8) in the
PrimitiveInt
node titled "Steps".
⚠️ Limitations & Notes
Coherence Decay: As with any video extension technique, coherence with the original reference image may decrease the longer the video gets.
High VRAM Usage: Generating long videos can be VRAM-intensive. The memory cleanup nodes are crucial for stability.
Artistic, Not Photorealistic: The 4-step approach is optimized for expressive motion, not for achieving perfect, stable, photorealistic frames. Embrace the abstract and dynamic nature of the output!
Tags: ComfyUI
, Workflow
, Wan2.2
, Sound2Video
, S2V
, Video Generation
, AI Video
, Long Video
, High Motion
, AnimateDiff
, AI Animation