My Wan2.1 I2V and Upscaling Workflows

I've been asked a lot about my workflows/process lately so I decided to upload them. I'm not an expert by any means, just figuring this stuff out as I go. I think my results are getting pretty decent now so I think the workflows have some merit to them. Feel free to play with them and suggest enhancements.

I'm also a purist. I avoid adding unnecessary "QoL" node packages to my workflow so any nodes that you are prompted to download are because I couldn't find the means to perform the same function with the base nodes. Feel free to add your own QoL nodes.

Dependencies

comfyui-frame-interpolation: Used to convert Wan's 15-16FPS videos into smoother 30FPS+ videos.

comfyui-videohelpersuite: primarily for saving video files as mp4s and loading image batches from folders.

comfyui_ultimatesdupscale: the core of my upscaling workflow.

All of them can be acquired with the ComfyUI Manager, which is a must to deal with the mess of dependencies some of these workflows can become (https://github.com/Comfy-Org/ComfyUI-Manager).

Workflow Convention

Flow goes in node groups from left to right.

Within a node group, columns go left to right.

Within a column, nodes go top to bottom.

Green groups are parameters that you are likely to normally tinker with.

Blue groups do processing.

Orange groups generate output.

Green nodes are nodes you are likely to tweak often.

Red nodes can be changed in values with the understanding that they either have strong implications or you need to be mindful of the values.

Grey nodes are pretty much set-it-and-forget-it nodes.

Explanation

Process

I use workflow WAN2.1 I2V v1.1.7 to generate my base video. I save both an interpolated video and the individual frames as files. The video lets me see the final results. The files are used for the upscaling process if I choose to upscale. I heavily recommend not upscaling from video files as the mere creation of them will make color changes to the output even if you don't do any compressing. This was bugging me for some time until I realized I should be working with image files top-to-bottom and only convert to video in the final step.

Once I get a video generation I'm happy with, I put the image frames of that video in a folder and run the folder through the USDU Image Batch v1.1 workflow. This does upscaling with denoising so you're not just blowing up the image.

Once that completes, I run the upscaled images through the Interpolate Image Batch v1.0 workflow. This last workflow is very simple and almost seems like it should be part of the upscaling workflow but upscaling is very resource-intensive and Comfy likes to keep node output in memory. The combination of upscaled frames and interpolated frames causes me to go OOM even on my best computer (16GB VRAM, 96GB system RAM) so I found I had to break them up into separate workflows. No biggie.

Workflows in Depth

WAN2.1 I2V v1.1.7

Pretty standard Wan2.1 I2V workflow with the only real tweak being that I made it into a dual-sampler setup. I mostly just use the first sampler and set the second sampler column to bypass nodes. I enable the second sampler if using LightX2V is causing the content LoRAs I'm using to misbehave. I have found that partially rendering with LightX2V and partially rendering with no boosting LoRAs allows me to retain some of the benefits of LightX2V while still retaining the content LoRAs' original behavior.

If I'm rendering with just LightX2V (which I try to when I can), I set the "Steps" node to 6 and the "First pass end" at the same value, 6, and call it a day. If I'm using both samplers, I set "Steps" to 16 and "First pass end" at 3. This causes The first sampler to do 3 steps with LightX2V and then the second sampler does 13 steps without LightX2V. If using one sampler, set the first KSampler's "return_with_leftover_noise" to disable but, if using both samplers, set it to enable. Theoretically, you should be able to keep this value as disabled and just have the second sampler add its own noise but my testing showed the results were not as good. Might need more testing, though.

Note on system performance:

I have two machines I use for AI generation. One has 16GB of VRAM and 96GB of RAM. The other has 12GB of VRAM and 32GB of RAM. I can run the workflow in both systems with the caveat that the 12/32GB system really struggles with the models since it ends up hitting the page file often. For that system, I get much better performance (ie, minimal page file hits) by changing the following...

Load CLIP: set to umt5_xxl_fp8_e4m3fn_scaled

Load Diffusion Model: set to wan2.1_i2v_480p_14B_fp8_e4m3fn

If your system is struggling with the default models, try using the ones above. If you're still getting OOM errors or unreasonable performance, you may need to drop your video resolution or the number of frames you generate. I typically do 480x640 or 480x720 at 81 frames on my secondary system or 113 frames on my main system. Sometimes I go higher in dimensions but I've found some LoRAs don't like that. I can push up to 720x960 on my main system but I'd rather squeeze in faster generations at a lower size and upscale later once I pick the good ones.

USDU Image Batch v1.1

The salient points about this workflow are as follows:

You don't need a negative prompt as you will be denoising at 1.0 CFG, which ignores the negative prompt.

Feel free to experiment with or without the FusionX LoRA. I tend to denoise at low values to retain the look and feel of the base image but, if you want to let the upscaler get more creative, you can denoise at higher values and use FusionX to mess with the look and feel.

I have found that a Shift of 2.0 or higher is needed if you keep Shift at 1.0 or don't use a Shift node (same as keeping Shift at 1.0) then there's a lot of jittering between frames. Shift smooths that out. This workflow is a variant of one I found on Reddit and it defaulted to a Shift of 10 but my testing showed in tended to yield blurrier results than I'd like. In a nutshell, you want the lowest possible Shift to retain sharpness while minimizing jitter. 2.0 seems to work fine for me but I encourage running your own tests.

On the upscaler itself, the main values to consider are "upscale_by", "denoise", and "tile_width" and "tile_height". For the tiles, set them to as high as you can while not getting OOM errors. When I downloaded this workflow, the tile sizes were set to 1024 which gave me OOM errors. I halved them to 512 and it worked. Then I kept bumping the values up and found I could pull off up to 720. Find the highest value that allows your system to still work. "upscale_by" is set to 2.0 by default. The more you upscale the more you need to denoise to make up for the extra blurriness and lack of detail. 2x upscale sounds neat on paper and most of my recent upscaled videos are at 2x but I'm on the fence as to whether I'll continue doing 2x or maybe just do 1.5-1.6x instead. For "denoise", you can go between 0.1 to 0.350 before you really start to lose the original video and lots of extra elements start getting added in. I typically do between 0.1 and 0.2. Again, the more you upscale by, the more you will have to denoise.

The odd thing on this workflow that may stand out is the steps to add and remove padding in the front and back of the video. The workflow takes the first 15 frames, reverses them, and then adds them to the front. Then does the same thing to the last 15 frames and appends them to the end. After upscaling, these extra frames are dropped. The reason I do this is because I've noticed the upscaler seems to need a few frames to "settle" on what the video should look like. That's the observed behavior, not sure how accurate that is. By giving it extra frames that are similar to the video content, the upscaler messes up frames that are not going to be used. By the time it gets to the frames you are actually going to keep, the upscaler is generating more accurate images.

If you don't believe me, try upscaling a video with and without steps 3, 4, 6 and 7. Without the padding, you may see some color shifting at the beginning of the video and some weird blurriness at the end of the video. Both of those go away when using padding.

Closing

Enough rambling. I'll wait for feedback/questions before going in depth with anything else on these workflows. If you have any suggestions for improvements, feel free to comment.