Basic understanding of video generation with WAN2.2 Lightx2v and SVIPro

This guide covers the basics aspects of video generation based on knowledge (not copy-pasted IA text bllshift) with the intention to help you understand what are you doing when using WAN2.2, SVIPro, Lightx2v all together.

This is not easy: it is what it is even when simplified to this level.

Put this guide side by side to the WAN template or the SVIPro template, read the guide while looking at the nodes properties and how the nodes are connected.

If you wonder how do I do the long videos jump to SVI

WAN2.2 ComfyUI template is enough

Look at the default WAN video generation template from comfyUI and you'll see all the components I mention except for SVIPro (which can be downloaded from their repository).

As I mention the components you'll progressively understand the connections and the meanign of the nodes. I hope so. That will allow you to add more nodes (not only lora nodes).

Generating a segment of video is like solving a set of equations, a matrix problem multiplication, being a state (LATENT) represented in a multidimensional space: people like it to call them tensors, because sounds cool. A tensor its just a state representation in a matrix no matter how many dimensions the matrix have. A state is the amount of values and their relation to describe a model in a given moment.

Forget about what you know about videos: a WAN video is a single state.

Latent (very important)

A latent is the video understood in the AI natural language or parameters (tensor space).

The fun part of WAN2.2 video latents is that all frames of the video are solved at once, the entire lantent is processed at once since the entire video represents a state*. Space and time is processed and solved at the same time: you don't generate a frame then another then another. Imagine the whole video as a stack of frames: solving (denoising) happens at the same time in the spatial direction and in the XY directions of each frame, as well as in all the other dimensions of the tensor.

The solving process consists on converting a latent composed by NOISE into something you would recongize, this process is called DENOISE.

Thats why your video requires a number of frames and frame size, to generate the latent.

And that is why most latent operations are not compatible with WAN2.2 latent. Each latent must match its model. If you try to cut, trim, batch operate, with nodes for latents probably you'll get errors of "expected this dimension but got this other".

The total operations in one step represent all the matrix multiplications your video as a matrix (tensor) must suffer. Just a thought: Matrix are composed by numbers and in computers numbers have a limit number of variations (bits), stacking loras stack values in one operation of multiplication. So when the video is multiplied by the loras, stacked, may end up multiplied by a number which reaches the limit of your precision (bits): thats when videos generate rain, noise, etc. Are points which value has "overflown": its a value bigger (or smaller) than the biggest (or smallest) possible for your precision.

VAE / VACE

These are the translators. VAE is the encoder/decoder translator form the latent world to the human understanding world. In the case of video tranforms the latent into a succession of images. VACE its practically the same but where output can be guided as well. Remember then: if VAE is the translator, the better the VAE the better the translation. That is why you use VAE to bring the latent to the video world or use the VAE as well to tranform an image into a latent. Two models may share the encoding proces (WAN 21 WAN22) so may use the same basic VAE. Same happens in images with SDXL and some of its variants: share a single VAE. In the case of WAN which is a vector tensor, each element of the tensor is a multidimensional vector which, evaluated with its surroundings, can provide several pixels of information: 16x16 once decoded. Its a form of compression? Yes. Called convolution. Its like video compression? Not exactly.

Denoise

Denoise goes from 1 (total noise) to 0 (which is considered denoised). Denoise is the process where someting meaninful emerges from a total random set of values (noise). The latent starts at noise and you put some initial information in it then it is processed forcing some guidance.

WAN2.2 denoises in two ranges: HIGH and LOW. With what its called two EXPERTS or denoisers.

Denoise process is done in STEPS, each step is an "increment of variation", where you check if your computation is going "ok". Thats why you have HIGH steps and LOW steps.

HIGH noise (and their steps) goes from 1 to 0.9

LOW noise (and their steps) goes from 0.9 to 0

Shift Value

This is very important since SHIFT is used to determine where high to low transition is done. Shift value tells the computing process when to pass the computation from the high to the low expert.

SHIFT is a bad parameter because WAN2.2 models are trained to transition in that very specific value of 0.9 (which is called BOUNDARY). A bad shift value may move the process too early or too late from high to low resulting in what we see as garbage.

Lets see why by understanding the Scheduler.

Scheduler

The SCHEDULER is in charge to calculate the values called SIGMAS, which are the places where you check if stuff its "ok". Since there are many ways to go from 1 o 0 there are many schedulers. Each scheduler follows a different pre-defined route.

Yes: Sigmas are just a set of ordered numbers from 1 to 0 which must contain 0.9 in between. The size of that set of numbers is the numbers of steps.

So: You have noise (noise = 1) you use a Scheduler to determine step values (SIGMAS) which will gradually solve some equations (DENOISE --> noise = 0) to reach a denoised latent.

Thus, if you know the SCHEDULER, the STEPS and the BOUNDARY why do you need a shift? You don't, mostly. If you can pre-calculate the sigmas.

My advise: use WanMoScheduler

This node can pre-calculate the sigmas for many schedulers forcing a specific boundary (which you need) if you tell how many low and high steps you want.

This node and working with sigmas instead of shift avoids the "shift" try-and-error you probably have experienced. Sigmas are calculated in a vector and you can assign them as you please. Shift must be rerouted in your usual workflow but its calculated atomatically on the mentioned node, no need to guess. Sigmas before 0.9 are HIGH, sigmas after 0.9 are LOW. 0.9 belongs to both sides.

No more shift problems using this method.

Example of 6-high and 6-low steps dividing the high steps into 2+4 afterwards using "splitsigmas" node at step 2. Noticed how 0.9 is shared between both high and low. The curve you see its determined by the "simple" scheduler, other schedulers will provide other curves. The node also calculates a "shift" value specific for the combination the CORRECT value that must be fed in in the "model" workflow. both shift and sigmas must be used together.

Sampler, Guider, Model (and Loras)

For this denoise process you need three more basic ingredients:

First the calculator machine, the processor or way of denoise, thats called the sampler. The sampler will calculate the jumps of values in the latent from sigma to sigma from 1 to 0.

Second the meaning or the guide to the denoise, the excuse, something to check if that noise is becoming something specific: the GUIDER or CONDITIONER.

Third, the model and complementary model information (loras/checkpoints) which is the basic sieve or sets of gradual nets where noise is computed or pushed through.

You may have heard of the sampler by its name: Euler, Euler A, etc. Those are just ways to evolve the noise.

The guider is where the prompt resides and can be of many forms simultaneously applied.

The model in this case is WAN2.2 which is only a set of levers with different weights where noise is pushed through (like a sieve) conditioned by the GUIDE the modifing extra levers and weights (loras and checkpoints) following a path (scheduler) with specific direction (sampler) for each variable.

Thousands of millions of levers with very specific positions or weights.

For the sake of simplicity lets reduce the guidance into only two elements (you probably have used or heard about more types of guidances):

- The prompt

- The SVI anchor_samples (use the smallest SVIPro template from their repository)

Controlnet, masks, etc. are other ways to guide the denoise process. And you've seen them applied probably in complex workflows where you force a motion or you only change part of an image or video.

Every model has its prompting scheme, a way to translate human language to the model's language: the TOKENIZER. "One token one concept".

You can modify, enrich, constrain or push that translation by means of model modifiers which are called (among others): checkpoints or loras. Which are patches to the sieve or net with specific meanings or which change the very model weights or the relation between the model and the tokens.

That is why some loras require specific words (which trigger certain tokens with modified weights in the overall sum of model plus modifications).

Every model needs to translate our concepts to the mathematical levers of that model, that is the tokenizer or the clip.

Every model also has a way to translate "images" into the natural model language as well: the VAE encoder.

And a way to recover from the models language to our image interpretation the VAE decoder.

There are lots of other components not covered in this guide.

If you add modifiers to the model (loras/checkpoints) the same prompt reacts differently as the levers and sieve have changed: the set of tensors where the noise is pushed through.

CFG

The CFG its the way you tell the computing process how hard must force an adherence between the tokenized prompt and the resulting noise of a STEP. the "ok".

WAN2.2 has positive and negative prompt: dos and do nots.

As a mathematical process, and this is very important: negative prompts are only used if CFG is greater than 1. Which implies another thing: if its greater than one you must compute both positive and negative prompts so thats "twice" the computing time.

Don't waste your time filling the negative prompt with complex "do nots" if your CFG its 1 because they are mathematically ignored: 0.

But hey, there is a trick: CFG is STEP wise! It means you can force the adherence differently at each step.

Tokens, Adherence and Special LoRAs

Token weights alone are not enough to mantain some structures over the denoise process, for example a face, at least not enough to fool us thinking its the "same" face.

That is why there are "character" loras which induce or force weights on the model, then the model denoises in specific ways if certain tokens are present which help us fool thinking its the same car, person or prop.

These modifiers must be created (trained): which is the process of distilling from the model which tokens and in-which-way (weights) something is what it is.

So you need something to compare: samples. And something to relate it to: prompts. Training is to bring to the surface the relation between samples (images or videos in this case) some associated prompts (our meaning) and the model's tokens (the model meanings).

Special Meaningful Loras

So in this complex process of loras, nobody said that these additional components were limited to represent a character or the characteristics of a car brand. You know can represent a concept, a kind of movement: a dance, a jump, all dances and all jumps...

LIGHTx2v

Lightx2v is actually a variation of WAN model (a whole model) trained with what we perceive as motion in mind, accelererated it in the way we percieved it. There is a very powerful model which is trained to accelerate the motion in terms of our reality and understanding which helps to reduce the steps to get what we percieve as motion.

Some people extracted the levers or concepts of the model affected by this motion and turned them into an independent file to attatch to the original WAN2.2 model and its variants: those are the lightx2v loras.

LIGHTx2v loras is an understanding of motion as our senses perceive it (well actually as how videos record it since the model is generated based on videos), so can skip some motion calculation or STEPS.

That is why you get "similar" movement with less steps with lightx2v, but that acceleration comes with a cost: the motion in between and the detail preservation.

LIGHTx2v works with a CFG of 1 and 1 alone, away from that value of 1 the results "differ" from what is was meant for.

1030: https://huggingface.co/Kijai/WanVideo_comfy/tree/main/LoRAs/Wan22_Lightx2v

480p: https://huggingface.co/Kijai/WanVideo_comfy/tree/main/Lightx2v

(remember are I2V and you can use any from rank 2 to 128 what rank is its outside the scope of this guide, use r64 or r128).

The SVI folks have found a way to have a "generic" patch which with a VAE encoded image (anchor samples) forces some characteristics (levers and weights) without training a lora. Thats cool.

But think on what it is: this tells the denoise process "dont touch that too much or go back to that value" instead of "this should be this way" (specific lora)

So we have a Model with modifiers: loras/checkpoints. The model is the basic sieve which changes with the loras for the same prompted concept.

We work here with two contradicting loras:

SVI: which wants to preserve some lever values in place (the anchor). Easy to "use" because its weight is always 1.

lightx2v: which wants to accelerate the motion faster than what the model was designed for. Difficult to use because its weight depends on iself and other loras. Not to mention there are lots of lightx2v loras.

Loras & Templates: https://huggingface.co/Kijai/WanVideo_comfy/tree/main/LoRAs/Stable-Video-Infinity/v2.0

Specific worflows with SVI Pro:

https://github.com/vita-epfl/Stable-Video-Infinity/tree/svi_wan22/comfyui_workflow

(Warning: These workflows are intermediate stuff you'll need to download many extensions and will barely do what you intend)

If you're thinking...

So I have to use CFG to force my prompt to get what I want (prompt) but cannot use because then if its bigger than 1 I'll lose the lightx2v effect. How do I force the prompt then??

So you're telling me SVIPro will keep my image to the anchor but lightx2v will accelerate motion who do pixels move forced by lightx2v but remain in place with SVIPro??

And what about clip?? You didn't mention clip.

You understand more than 99,999% of AI video content generators int the word.

WOFF that explains so many "workflows" full of boxes you don't understand, sets of nodes you don't have and so many lines you want to alt+f4. Recipies you use without understanding whats going on.

But there is more.

Other

Your videos have a length: [frames] and size: [width] and [height].

As explanied before In the latent space THE WHOLE VIDEO is a big set of numbers (tensor) frames are not "separated" from time (*more or less*) so the whole space [width and height] and time [length] must be computed at once.

There are techniques to fragment the process but its important to understand the video itself is a wholesome unique element.

How loras behave on those parameters change as well.

Longer videos tend to degenerate the image quality because lightx2v overdoes and the number of steps is low (or not hight enough) and the computing numbers aren't precise enough.

Clip is also part of the prompting in terms of you're changing the default model behavour towards some vias (weights). Clip forces the concept Weight forces the value.

Bigger videos (in frame size) require different lightx2v weights to get the same motion.

In latent space you have 1 video frame for every 4 regular of our video frames. Also in I2V you need a initial video image which uses 1 hole latent frame, thats the reason why WAN videos are 4*n+1 long.

I leave the explanation of these concepts for another guide.

The bigger the model the more parameters, the higher precision of the model (longer numbers per concept) the more precise the result, but bigger numbers == bigger memory requirements.

The more precise numbers the longer the video will be sustained without degeneration, the less "acceleration" process the longer the video will be sustained without degeneration.

The more steps: less degeneration (smaller increments of "variation" and more "checks") but longer the computing time. Even more if CFG is bigger than one which requires to check positive and negative prompt (twice the time).

That is why you can generate or longer smaller videos or shorter bigger videos with the same VRAM. Or fit a bigger model with smaller videos, or fit a smaller model for bigger and longer videos.

That is why nGreedia is restricting VRAM on consumer GPUs. Any consumer card can compute with the highest quality available models the highest video quality, its just a matter of linear computing time if VRAM is properly sized. If you have to go back and forth to your regular RAM the time is multiplied by orders of magnitude, making the computing burden unpractical**.

Practical Recomendations (BASIC)

For Wan 2.2 use lightx2v 1030 HIGH, and 480p rank64 LOW with weights 1.35-1.15 and 1.10-1.05 in 1280x704 videos.

Motion is different at every resolution but those settings work for 832x480 testing as well (the motion and detail is different probably more accelerated). These light loras give much better results in detail and long videos than the wan21 light loras. But may give worse motion for wan21 trained loras. The detail outweights the loose in motion. Lowering the low from 1.10 to 1.00 increases the prompt following as well the jitter and craziness in movements.

For wan 2.2 and the aforesaid lightx2v loras don't prompt in the Subject, Scene, Action, Camera style. Doesn't work for most of its variants. Its better to use a succesion of actions with the details within.

Under these circumstances for better quality use at least 4H+4L steps.

Under these circumstances for better motion use sgm_uniform instead of simple scheduler.

Download only the nodes required for comfyUI template and SVIPro template.

Download the node I recommended (is just a node).

Append default WAN templates for longer videos, Its easy and it works:

The WAN2.2 I2V box is a default WAN template. We extract the last frame and we use it as startingh frame for the next prompt. You have to solve the shift problem as I explained. Each segment can be of a different length. You can save independent segments as well.

Whats inside each box?

Each of the WAN 2.2 I2V has no secret: its the default wan workflow, you should add here: the shift system I mentioned and specific loras for each segment. You can append about 3 videos without any problem and frames are up you your RAM.

Practical Recomendations (INTERMEDIATE)

For Wan 2.2 add an aditional HIGH sampler and split the sigmas. All samplers will have the same model and loras except for the first one which wont have the lightx2v lora.

So you have 3 samplers: HIGH_NOLIGHT, HIGH, and LOW.

HIGH_NOLIGHT with 1 or 2 steps, where, withouth lightv2x you're free to increase the CFG.

HIGH, which is your regular HIGH sampler with the remaining high samples. CFG = 1.

LOW, which is your regular LOW sampler. CFG = 1

Then you can use a 6+6 scheme where: 2 are for HIGH without light, 4 for HIGH with light and 6 for low.

Three samplers: Note how a new lora is added in the second sampler containing all loras plus Lightx2v, giving space on the first sampler to increase CFG at your will. Beware of high CFG. Sigma Split splits the high Sigmas between the two samplers. The non lightx2v must go first.

You're free to split low as well and increase CFG as well.

This configuration allows to play with 1217 light2xv loras which are even worse to control than 1030.

Use different weights and sets of loras for each video segment on SVIPro but do not deactivate them unless the item/action prompt by that lore is no more required, otherwise will vanish.

Remember you have to have sigmas pre-calculated and split them accordingly for each sampler, there is a great node I recommende for that: WanMoScheduler

You can use: VisualizeSigmasKJ node and Preview Image to see how sigmas are split.

You can use: SplitSigmas and step to split your sigmas in all the samplers you consider.

Practical Recommendations (SVI/SVI Pro)

I asume you have red the SVI point of this guide. SVI has two features:

1) it uses the initial frame of WAN as an anchor point, so for the model if "you dont know" then "look at that reference frame (latent information)" lets say. That's whats creates the character consistency. WAN 2.5 and further versions, is character consistent with more than one character among many improvements. Lets limit the discussion to WAN 2.2.

2) it can translate motion vectors to the next clip. That is why SVI pro videos are seamless. This is done by the motion latent count (counted in latent frames (i.e. lframe = 4 frames). "it can" it means "it can't" as well if you require an abrupt cut: 0 motion frames allow for seam complete abrupt motion cut, a scene cut f.e. 1 is for high motion continuity, the regular value, 2 translates more information but also its more rigid, its also good for valleys of motion (the "stop point" from going forward to going backward).

To do so you overlap lframes which will give the motion continuity (and also port noise and errors and slop to the next clip!). Look at the figure. You can select how many motion bring to the next clip by the number of overlaping lframes: 0, 1, 2 etc. In red its an example of 1 lframe overlap. in green an example of 2 frame overlap with the final result. When overlaping you loose the number of lframes you overlaped, which are the motion latent frames. Both motion latent count and overlap frames must be coherent.

Here is an example of how to control it dynamically:

the Iff() function checks the index of the segment (a), you can nest iff. If the index (a) its the desired value, apply a given motion latent count (1) else apply 1 as well. (a=1, apply 1, else 1). This is linked to the overlap frames which must be 1 (the anchor) plus n latent frames: 4*motion latent count.

Another example iif(a==3, 2, iif(a==5,2,1) --> if segment is 3, motion = 2, else if segment is 5 motion = 2 else motion = 1.

*If you dig deeper it is true that the latent can be processed (and in some workflows it is) in chunks. But that doesnt change the fact that the state is formed equally in the spatial and the time dimensions of the video.

About Similes

I don't usually like similes, I hate them actually, but I used sieve (in this case) for tensor operations. Anyone who has solved an Euler integration, which consist on multipling matrix (or tensors), which are constructed, usually, over the sum of transformations (sums and multiplications) of other matrix, knows that a sieve/sieving is a good enough mathematical aproximation of whats going on. Another would be: make fresh pasta.

You push (convert, go thought spaces, push to transformations) a vector or a matrix over other matrixes (sieves).

I use the word lever in terms of design of experiments (DOE) which is exaclty the weight of a variable of a convolution and has been used in many texts regarding the matter.

Personal notes

I started with ComfyUI and WAN on last week of december 2025, this guide has been published 3 months later (end of march 2026) after about 2000 generated videos in local hardware with the implied limitations.
I've lots to learn but I haven't found, even with the help of IA, a comprehensive way to understand what was this video generation stuff is about, that is why I wrote this guide. Any corrections will be appreciated.
The reason why the denoised stuff I publish does not contain the "recipe" is because I've found reverse engineering other's workflows lets to:
- hundred of hours lost on dead ends
- installation of lots of missing components which end up breaking your comfyUI installation
- lots of specific user-corrections without any practical sense but to correct mistakes on the workflow
- workflow monsters which achieve 96% of what templates achieve with small modifications