Type | Workflows |
Stats | 127 0 |
Reviews | (17) |
Published | May 3, 2025 |
Base Model | |
Hash | AutoV2 1457A7BB83 |
This is a workflow I posted earlier on Reddit/Github:
https://www.reddit.com/r/StableDiffusion/comments/1k83h9e/seamlessly_extending_and_joining_existing_videos/
It exposes a somewhat understated feature of WAN VACE, which is the temporal extension. It is underwhelmingly described as "first clip extension" but actually it can auto-fill pretty much any missing footage in a video - whether it's full frames missing between existing clips or things masked out (faces, objects).
It's better than Image-to-Video / Start-End Frame because it maintains the motion from the existing footage (and also connects it to the motion in later clips).
Watch this video to see how the source video (left) and mask video (right) look. The missing footage (gray) is in multiple places, missing face, etc that is all then filled out by VACE in one shot.
This is built on top of Kijai's WAN VACE workflow. I added this temporal extension part as a 4th grouping in the lower right. (so credits to Kijai for the original workflow).
It takes in two videos, your source video with missing frames/content in gray and a mask video that is black-and-white (the missing gray content recolored to white). I usually make the mask video by setting brightness to -999 or something to that effect on the original while recoloring the gray to white.
Make sure to keep it at about 5-seconds to match Wan's default output length (81 frames at 16 fps or equivalent if the FPS is different). You can download VACE's example clip here for the exact length and gray color (#7F7F7F) to use on the source video: https://huggingface.co/datasets/ali-vilab/VACE-Benchmark/blob/main/assets/examples/firstframe/src_video.mp4
In the workflow itself, I recommend setting Shift to 1 and CFG around 2-3 so that it primarily focuses on smoothly connecting the existing footage. I found that having higher numbers introduced artifacts sometimes.
Models to download:
models/diffusion_models: Wan 2.1 T2V (Pick 1):
FP16: https://huggingface.co/IntervitensInc/Wan2.1-T2V-1.3B-FP16/blob/main/diffusion_pytorch_model.safetensors
BF16: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan2_1-T2V-1_3B_bf16.safetensors
FP8: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan2_1-T2V-1_3B_fp8_e4m3fn.safetensorsmodels/diffusion_models: VACE Preview: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan2_1_VACE_1_3B_preview_bf16.safetensors
models/text_encoders: umt5-xxl-enc (Pick 1):
BF16: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/umt5-xxl-enc-bf16.safetensors
FP8: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/umt5-xxl-enc-fp8_e4m3fn.safetensorsmodels/vae: wan 2.1 vae (any version): https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan2_1_VAE_bf16.safetensors
An additional video here for what it looks like loading in the video inputs.