Sign In

Updated Wan 2.2 I2V Workflow

3

Feb 1, 2026

(Updated: 2 months ago)

workflows
Updated Wan 2.2 I2V Workflow

DO NOT TREAT THIS WORKFLOW AS 'These are the perfect settings, don't change anything, this is the ultimate workflow'! Try different samplers, use whatever models you want to use, change settings and watch what happens and LEARN from it. One of the most wonderful thing about this workflow is its brutal honesty. I have included this entire explainer in the workflow along with the github for any custom nodes required so you don't have to hunt for them or smash your face into your desk. I absolutely hate when I download a workflow to check it out and I get a popup that says I am missing 324322 nodes and that shit don't work.

I also uploaded a video with a version that has upscale and interpolation built in here:

https://civitai.com/posts/26295253

As you pan around this workflow you will notice some extreme differences from other commonly distributed workflows. I didn't create any subgraphs and left everything out in the open swinging in the breeze. Feel free to make yourself at home. A short summary of key stand-out concepts is in order:

Precision Gating by using Q2K, Q5K_S, and fp16 models to ensure the model only acts with the level of precision desirable at each specific denoising stage of video generation. You can read about this concept here:

https://civitai.com/articles/25076/q2k-greater-fp8-the-precision-trap

renormCFG and rescaleCFG

Proper signal conditioning by renormalizing and rescaling key elements of the model and conditioning at key stages. Renormalization ensures the conditional signal has a comparable magnitude to the model’s internal activations. Rescaling ensures that signal is applied proportionally instead of spiking or collapsing. At CFG 1.0, precise clean conditioning is critical. Without renorm and rescale, the conditional signal is numerically weak and uneven, so the model defaults to its prior or LoRA bias and appears to “ignore” the prompt. Renorm and rescale shapes the signal so the model receives clean proportional conditioning.

scaleROPE

RoPE stands for Rotary Positional Embeddings. I have included a Scale RoPE node in the conditioning of each model to ensure Wan gets a clean RoPE signal. RoPE encodes relative position directly into attention, which is why WAN can maintain motion, camera movement, and spatial coherence across frames instead of re-deciding space every step. Wan's own RoPE is... special. RoPE was never designed to be used outside of 2 dimensions but it was the only reasonable thing they (and most video models currently) could use and as such, is something of a hack job. The folks that wrote the Scale RoPE node really know the ins and outs of how RoPE works and the node allows for a similar normalization and stabilization around a known good baseline of 1.0 giving Wan a cleaner more reliable and precise positional awareness over time.

FreSca

Frequency and noise cleanup using nodes like FreSca. When you are dealing with quantization or any model signal in general the highest frequency elements of the model typically represent just... noise. Garbage. Especially in heavily quantized models. The FreSca nodes in the High and Mid noise chains clip off the highest frequency portions of the signal mitigating things like jitter, quantization noise, bad harmonic spikes, and ringing, etc.

The freq_cutoff value of the FreSca node represents its FFT cutoff point. The model signal is represented as a square grid by FreSca and the center of the grid is low frequency, and as the grid expands outwards it becomes higher and higher frequency. Mathematically Wan is represented in FreSca as a 32x32 grid. The node does nothing until the freq_cutoff is 32 or lower. As you lower the value, FreSca trims off more and more of the grid, so a value of 30 cuts the model signal into a 30x30 grid and eliminates the 2 highest frequency bands of the signal. The current settings are pretty conservative and only trim off the very highest frequency elements of Q2K and Q5K_S. The really interesting thing about this is when you cut the higher frequencies the model is forced to recreate something in its place using lower more stable frequency parts of the model.

The page for FreSca focuses on image generation but if you want to read more or look at the code:

https://github.com/WikiChao/FreSca/

The signals in video diffusion are just like the signals in audio production or electronics; their integrity must be established and maintained. The fact that I never see anyone discussing this blows my mind. Shaping and conditioning these signals adds near-zero overhead and can make such a massive difference just on its own. I could write a whole article series on this but hopefully this short synopsis will give a basic rationale for now.

Each individual LoRA is given its own node with values for model and CLIP signal strength. The best way to think of this relationship is that the model strength dictates how much the LoRA alters the visual appearance and shape of the video and CLIP controls how loud the LoRA screams at the model to do its bidding and use its biases. One of the most overlooked parts of video generation is how these elements factor into the generative process and how semantic pressure from CLIP affects each sampling stage.

You will notice that I have turned the CLIP strength completely off for the Lightning LoRA and for all LoRA at low noise. At low noise the decisions and commitments to things like identity, motion, spatial relationships, etc. are decided and ideally fully finalized. Allowing CLIP to keep screaming at the model in Low noise gives the LoRA the opportunity to force the model to reopen and reconsider decisions and commitments made during High noise. If Lightning LoRA CLIP is also active it further amplifies this effect. This is one of the main ways you end up with changing faces, deformed half finished anatomy, etc. In general it seems Low noise is often just looked at as 'polish' but it is in fact much closer to a military death squad on a mop up operation. Anything that isn't set in stone MUST be converged and completed, and the death squad of Low noise grants no mercy. By setting the CLIP values to 0.00 the LoRA aren't standing around when the death squad arrives telling them 'actually, I don't think this face is quite done yet, I would actually prefer something different'.

With the model strength of the LoRA still active in low noise the visual elements and biases of the LoRA are still used during refinement by the model, acting as a manual for completion.

I have also included some other note nodes in the workflow and you should definitely take the full tour. I really wanted to spend more time working on this and making it presentable and all that jazz but I am back in school and wanted to make sure I could get this out before other obligations got in the way, as many have requested an example workflow and I didn't want to give out some weak-ass lame example without most of the bells and whistles. With that said, this is not even my complete current workflow, and I still need to finalize my custom sampler code and custom sampling node to really round this out. For now, this should give all of you plenty to chew on.

WARNING: This workflow is brutally honest. It will not pamper you and fix bad prompting habits by overcorrecting in low noise. It will not take a bad input image and spit out a 'well here is at least something based on the overwhelming and intense LoRA biases slamming the model'. If you try to use a poorly designed or badly sloppily merged checkpoint it will expose the checkpoints issues and failure modes. If you try to use really badly trained LoRA or LoRA that rely on sloppy workflows allowing continuous renegotiation of identity and motion even into low noise they will fail but they will also expose their mechanism. Above all EXPERIMENT AND HAVE FUN

3