This guide covers the basics aspects of video generation based on knowledge (not copy-pasted IA text bullshit) with the intention to help you understand what are you doing when using WAN2.2, SVIPro, Lightx2v all together.
This is not easy: it is what it is even when simplified to this level.
Put this guide side by side to the WAN template or de SVIPro template read it and look at the nodes.
If you wonder how do I do the long videos jump to SVI
WAN2.2 ComfyUI template is enough
Look at the default WAN video generation template from comfyUI and you'll see all the components I mention except for SVIPro (which can be downloaded from their repository).
As I mention the components you'll understand the connections. I hope so.
Generating a segment of video is like solving a set of equations, a matrix problem multiplication, being a state (LATENT) represented in a multidimensional space: people like it to call them tensors, because sounds cool. A tensor its just a state representation in a matrix no matter how many dimensions the matrix have. A state is the amount of values and their relation to describe a model in a given moment.
Latent
A latent is the video understood in the AI natural language or parameters (tensor space).
The fun part of WAN2.2 video latents is that all frames of the video are solved at once, the entire lantent is processed at once since the entire video represents a state*.
The solving process consists on converting a latent composed by NOISE into something you recongize, this process is called DENOISE.

The total operations in one step represent all the matrix multiplications your video as a matrix (tensor) must suffer. Just a thought: Matrix are composed by numbers and in computers numbers have a limit number of variations (bits), stacking loras stack values in one operation of multiplication. So when the video is multiplied by the loras, stacked, may end up multiplied by a number which reaches the limit of your precision (bits): thats when videos generate rain, noise, etc. Are points which value has "overflown": its a value bigger (or smaller) than the biggest (or smallest) possible for your precision.
Denoise
Denoise goes from 1 (total noise) to 0 (which is considered denoised). Denoise is the process where someting meaninful emerges from a total random set of values (noise). The latent starts at noise and you put some initial information in it then it is processed forcing some guidance.
WAN2.2 denoises in two ranges: HIGH and LOW. With what its called two EXPERTS or denoisers.
Denoise process is done in STEPS, each step is an "increment of variation", where you check if your computation is going "ok". Thats why you have HIGH steps and LOW steps.
HIGH noise (and their steps) goes from 1 to 0.9
LOW noise (and their steps) goes from 0.9 to 0
Shift Value
This is very important since SHIFT is used to determine where high to low transition is done. Shift value tells the computing process when to pass the computation from the high to the low expert.
SHIFT is a bad parameter because WAN2.2 models are trained to transition in that very specific value of 0.9 (which is called BOUNDARY). A bad shift value may move the process too early or too late from high to low resulting in what we see as garbage.
Lets see why by understanding the Scheduler.
Scheduler
The SCHEDULER is in charge to calculate the values called SIGMAS, which are the places where you check if stuff its "ok". Since there are many ways to go from 1 o 0 there are many schedulers. Each scheduler follows a different pre-defined route.
Yes: Sigmas are just a set of ordered numbers from 1 to 0 which must contain 0.9 in between. The size of that set of numbers is the numbers of steps.
So: You have noise (noise = 1) you use a Scheduler to determine step values (SIGMAS) which will gradually solve some equations (DENOISE --> noise = 0) to reach a denoised latent.
Thus, if you know the SCHEDULER, the STEPS and the BOUNDARY why do you need a shift? You don't, mostly. If you can pre-calculate the sigmas.
My advise: use WanMoScheduler
This node can pre-calculate the sigmas for many schedulers forcing a specific boundary (which you need) if you tell how many low and high steps you want.
This node and working with sigmas instead of shift avoids the "shift" try-and-error you probably have experienced. Sigmas are calculated in a vector and you can assign them as you please. Shift must be rerouted in your usual workflow but its calculated atomatically on the mentioned node, no need to guess. Sigmas before 0.9 are HIGH, sigmas after 0.9 are LOW. 0.9 belongs to both sides.
No more shift problems using this method.

Example of 6-high and 6-low steps dividing the high steps into 2+4 afterwards using "splitsigmas" node at step 2. Noticed how 0.9 is shared between both high and low. The curve you see its determined by the "simple" scheduler, other schedulers will provide other curves. The node also calculates a "shift" value specific for the combination the CORRECT value that must be fed in in the "model" workflow. both shift and sigmas must be used together.
Sampler, Guider, Model (and Loras)
For this denoise process you need three more basic ingredients:
First the calculator machine, the processor or way of denoise, thats called the sampler. The sampler will calculate the jumps of values in the latent from sigma to sigma from 1 to 0.
Second the meaning or the guide to the denoise, the excuse, something to check if that noise is becoming something specific: the GUIDER or CONDITIONER.
Third, the model and complementary model information (loras/checkpoints) which is the basic sieve or sets of gradual nets where noise is computed or pushed through.
You may have heard of the sampler by its name: Euler, Euler A, etc. Those are just ways to evolve the noise.
The guider is where the prompt resides and can be of many forms simultaneously applied.
The model in this case is WAN2.2 which is only a set of levers with different weights where noise is pushed through (like a sieve) conditioned by the GUIDE the modifing extra levers and weights (loras and checkpoints) following a path (scheduler) with specific direction (sampler) for each variable.
Thousands of millions of levers with very specific positions or weights.
For the sake of simplicity lets reduce the guidance into only two elements (you probably have used or heard about more types of guidances):
- The prompt
- The SVI anchor_samples (use the smallest SVIPro template from their repository)
Controlnet, masks, etc. are other ways to guide the denoise process. And you've seen them applied probably in complex workflows where you force a motion or you only change part of an image or video.
Every model has its prompting scheme, a way to translate human language to the model's language: the TOKENIZER. "One token one concept".
You can modify, enrich, constrain or push that translation by means of model modifiers which are called (among others): checkpoints or loras. Which are patches to the sieve or net with specific meanings or which change the very model weights or the relation between the model and the tokens.
That is why some loras require specific words (which trigger certain tokens with modified weights in the overall sum of model plus modifications).
Every model needs to translate our concepts to the mathematical levers of that model, that is the tokenizer or the clip.
Every model also has a way to translate "images" into the natural model language as well: the VAE encoder.
And a way to recover from the models language to our image interpretation the VAE decoder.
There are lots of other components not covered in this guide.
If you add modifiers to the model (loras/checkpoints) the same prompt reacts differently as the levers and sieve have changed: the set of tensors where the noise is pushed through.
CFG
The CFG its the way you tell the computing process how hard must force an adherence between the tokenized prompt and the resulting noise of a STEP. the "ok".
WAN2.2 has positive and negative prompt: dos and do nots.
As a mathematical process, and this is very important: negative prompts are only used if CFG is greater than 1. Which implies another thing: if its greater than one you must compute both positive and negative prompts so thats "twice" the computing time.
Don't waste your time filling the negative prompt with complex "do nots" if your CFG its 1 because they are mathematically ignored: 0.
But hey, there is a trick: CFG is STEP wise! It means you can force the adherence differently at each step.
Tokens, Adherence and Special LoRAs
Token weights alone are not enough to mantain some structures over the denoise process, for example a face, at least not enough to fool us thinking its the "same" face.
That is why there are "character" loras which induce or force weights on the model, then the model denoises in specific ways if certain tokens are present which help us fool thinking its the same car, person or prop.
These modifiers must be created (trained): which is the process of distilling from the model which tokens and in-which-way (weights) something is what it is.
So you need something to compare: samples. And something to relate it to: prompts. Training is to bring to the surface the relation between samples (images or videos in this case) some associated prompts (our meaning) and the model's tokens (the model meanings).
Special Meaningful Loras
So in this complex process of loras nobody said that these additional components were limited to represent a character or the characteristics of a car brand.
LIGHTx2v
There is a very powerful lora which is trained to accelerate the motion in terms of our reality and understanding which helps to reduce the steps to get what we percieve as motion, those are the lightx2v loras.
LIGHTx2v loras is an understanding of motion as our senses perceive it (well actually as how videos record it), so can skip some motion calculation or STEPS.
That is why you get "similar" movement with less steps with lightx2v, but that acceleration comes with a cost: the motion in between and the detail preservation.
LIGHTx2v works with a CFG of 1 and 1 alone, away from that the results "differ" of what is was meant for.
1030: https://huggingface.co/Kijai/WanVideo_comfy/tree/main/LoRAs/Wan22_Lightx2v
480p: https://huggingface.co/Kijai/WanVideo_comfy/tree/main/Lightx2v
(remember are I2V and you can use any from rank 2 to 128 what rank is its outside the scope of this guide, use r64 or r128).
SVI
The SVI folks have found a way to have a "generic" patch which with a VAE encoded image (anchor samples) forces some characteristics (levers and weights) without training a lora. Thats cool.
But think on what it is: this tells the denoise process "dont touch that too much or go back to that value" instead of "this should be this way" (specific lora)
So we have a Model with modifiers: loras/checkpoints. The model is the basic sieve which changes with the loras for the same prompted concept.
We work here with two contradicting loras:
SVI: which wants to preserve some lever values in place (the anchor). Easy to "use" because its weight is always 1.
lightx2v: which wants to accelerate the motion faster than what the model was designed for. Difficult to use because its weight depends on iself and other loras. Not to mention there are lots of lightx2v loras.
Loras & Templates: https://huggingface.co/Kijai/WanVideo_comfy/tree/main/LoRAs/Stable-Video-Infinity/v2.0
Specific worflows with SVI Pro:
https://github.com/vita-epfl/Stable-Video-Infinity/tree/svi_wan22/comfyui_workflow
(Warning: These workflows are intermediate stuff you'll need to download many extensions and will barely do what you intend)
CFG
CFG: What lets me force what I want (prompt) but cannot use because I'll lose the lightx2v effect.
WOFF that explains so many "workflows" full of boxes you don't understand, sets of nodes you don't have and so many lines you want to alt+f4. Recipies you use without understanding whats going on.
But there is more.
Other
Your videos have a length: [frames] and size: [width] and [height].
As explanied before In the latent space THE WHOLE VIDEO is a big set of numbers (tensor) frames are not "separated" from time (*more or less*) so the whole space [width and height] and time [length] must be computed at once.
There are techniques to fragment the process but its important to understand the video iself is a wholesome unique element.
How loras behave on those parameters change as well.
Longer videos tend to degenerate the image quality because lightx2v overdoes and the number of steps is low (or not hight enough) and the computing numbers aren't precise enough.
Bigger videos (in frame size) require different lightx2v weights to get the same motion.
In latent space you have 1 video frame for every 4 regular of our video frames. Also in I2V you need a initial video image which uses 1 hole latent frame, thats the reason why WAN videos are 4*n+1 long.
So
The bigger the model the more parameters, the higher precision of the model (longer numbers per concept) the more precise the result, but bigger numbers == bigger memory.
The more precise numbers the longer the video will be sustained without degeneration, the less "acceleration" process the longer the video will be sustained without degeneration.
The more steps: less degeneration (smaller increments of "variation" and more "checks") but longer the computing time. Even more if CFG is bigger than one which requires to check positive and negative prompt (twice the time).
That is why you can generate or longer smaller videos or shorter bigger videos with the same VRAM. Or fit a bigger model with smaller videos, or fit a smaller model for bigger and longer videos.
That is why nGreedia is restricting VRAM on consumer GPUs. Any consumer card can compute with the highest quality available models the highest video quality, its just a matter of linear computing time if VRAM is properly sized. If you have to go back and forth to your regular RAM the time is multiplied by orders of magnitude, making the computing burden unpractical**
Practical Recomendations (BASIC)
For Wan 2.2 use lightx2v 1030 HIGH, and 480p rank64 LOW with weights 1.35 and 1.10. in 1280x704 videos.
Motion is different at every resolution but those settings work for 832x480 testing as well (the motion and detail is different). These light loras give much better results in detail and long videos than the wan21 light loras. But may give worse motion for wan21 trained loras. The detail outweights the loose in motion. Lowering the low from 1.10 to 1.00 increases the prompt following as well the jitter and craziness in movements.
For wan 2.2 and the aforesaid lightx2v loras don't prompt in the Subject, Scene, Action, Camera style. Doesn't work for most of its variatns. Its better to use a succesion of actions with the details within.
Under these circumstances for better quality use at least 4H+4L steps.
Under these circumstances for better motion use sgm_uniform instead of simple scheduler.
Download only the nodes required for comfyUI template and SVIPro template.
Download the node I recommended (is just a node).
Append default WAN templates for longer videos, Its easy and it works:

The WAN2.2 I2V box is a default WAN template. We extract the last frame and we use it as startingh frame for the next prompt. You have to solve the sifth problem as I explained. Each segment can be of a different length.
Whats inside each box?

Each of the WAN 2.2 I2V has no secret: its the default wan workflow, you should add here: the shift system I mentioned and specific loras for each segment. You can append about 3 videos without any problem and frames are up you your RAM.
Practical Recomendations (INTERMEDIATE)
For Wan 2.2 add an aditional HIGH sampler and split the sigmas. All samplers will have the same model and loras except for the first one which wont have the lightx2v lora.
So you have 3 samplers: HIGH_NOLIGHT, HIGH, and LOW.
HIGH_NOLIGHT with 1 or 2 steps, where, withouth lightv2x you're free to increase the CFG.
HIGH, which is your regular HIGH sampler with the remaining high samples. CFG = 1.
LOW, which is your regular LOW sampler. CFG = 1
Then you can use a 6+6 scheme where: 2 are for HIGH without light, 4 for HIGH with light and 6 for low.

Three samplers: Note how a new lora is added in the second sampler containing all loras plus Lightx2v, giving space on the first sampler to increase CFG at your will. Beware of high CFG. Sigma Split splits the high Sigmas between the two samplers. The non lightx2v must go first.
You're free to split low as well and increase CFG as well.
This configuration allows to play with 1217 light2xv loras which are even worse to control than 1030.
Use different weights and sets of loras for each video segment on SVIPro but do not deactivate them unless the item/action prompt by that lore is no more required, otherwise will vanish.
Remember you have to have sigmas pre-calculated and split them accordingly for each sampler, there is a great node I recommende for that: WanMoScheduler
You can use: VisualizeSigmasKJ node and Preview Image to see how sigmas are split.
You can use: SplitSigmas and step to split your sigmas in all the samplers you consider.
Practical Recommendations (SVI/SVI Pro)
I asume you have red the SVI point of this guide. SVI has two features:
1) it uses the initial frame of WAN as an anchor point, so if "you dont know look at that frame" lets say. Thats whats creates the character consistency. WAN 2.5 and on, is character consistent with more than one character btw. Lets limit the discussion to WAN 2.2.
2) it can translate motion vectors to the next clip. That is why SVI pro videos are seamless. This is done by the motion latent count (counted in latent frames (i.e. lframe = 4 frames). it can it means i can't as well: 0 motion frames allow for seam complete abrupt motion cut, a scene cut f.e. 1 is for high motion continuity, the regular value, 2 translates more information but also its more rigid, its also good for valleys of motion (the "stop point" from going forward to going backward).
To do so you overlap lframes which will give the motion continuity (and also port noise and errors and slop to the next clip!). Look at the figure. You can select how many motion bring to the next clip by the number of overlaping lframes: 0, 1, 2 etc. In red its an example of 1 lframe overlap. in green an example of 2 frame overlap with the final result. When overlaping you loose the number of lframes you overlaped, which are the motion latent frames. Both motion latent count and overlap frames must be coherent.

Here is an example of how to control it dynamically:

the Iff() function checks the index of the segment (a), you can nest iff. If the index (a) its the desired value, apply a given motion latent count (1) else apply 1 as well. (a=1, apply 1, else 1). This is linked to the overlap frames which must be 1 (the anchor) plus n latent frames: 4*motion latent count.
Another example iif(a==3, 2, iif(a==5,2,1) --> if segment is 3, motion = 2, else if segment is 5 motion = 2 else motion = 1.
*If you dig deeper it is true that the latent can be processed (and in some workflows it is) in chunks. But that doesnt change the fact that the state is formed equally in the spatial and the time dimensions of the video.
About Similes
I don't usually like similes, I hate them actually, but I used sieve (in this case) for tensor operations. Anyone who has solved an Euler integration, which consist on multipling matrix (or tensors), which are constructed, usually, over the sum of transformations (sums and multiplications) of other matrix, knows that a sieve/sieving is a good enough mathematical aproximation of whats going on. Another would be: make fresh pasta.
You push (convert, go thought spaces, push to transformations) a vector or a matrix over other matrixes (sieves).
I use the word lever in terms of design of experiments (DOE) which is exaclty the weight of a variable of a convolution and has been used in many texts regarding the matter.
Personal notes
I started with ComfyUI and WAN on last week of december 2025, this guide has been published 3 months later (end of march 2026) after about 2000 generated videos in local hardware with the implied limitations.
I've lots to learn but I haven't found, even with the help of IA, a comprehensive way to understand what was this video generation stuff is about, that is why I wrote this guide. Any corrections will be appreciated.
The reason why the denoised stuff I publish does not contain the "recipe" is because I've found reverse engineering other's workflows lets to:
hundred of hours lost on dead ends
installation of lots of missing components which end up breaking your comfyUI installation
lots of specific user-corrections without any practical sense but to correct mistakes on the workflow
workflow monsters which achieve 96% of what templates achieve with small modifications

