Darksidewalker's WAN 2.2 14B I2V - Usage guide - Definitive Edition
This guide will provide you:
👄 Explanation of wordings and basics
🧠 About WAN 2.2 checkpoints
🤖 WAN 2.2 requirements
🫴 Guidance on how to use it
✋ WAN 2.2 limitations / known issues
🔍 Explanation about frames and times
👣 Steps and motion
👀 Frame Interpolation
🎏 Up-casting
📝 Basic examples on how to write prompts
☄️ Speed-Up tricks
🧺 Caching (memory cache implementations)
🔝 GPU considerations/recommendations
This is based on information and SOTA implementations as of 10/2025TL/DR: You now missed the point of a guide! 🤣
So let's gooo! ~
Introduction
I'll focus on the basics. They are the same regardless of the software you use. A fair amount of basics and setting will work for t2v too. T2V is just simpler to use.
👄Explanation of wordings and basics
Checkpoint
refers to the file and type of AI model provided
Model
is the checkpoint type and generation in use to provide your inference
WAN
Is the model group provided by WANAI for making videos out of text or images
LoRA
Think of it like a micro checkpoint/model trained for a desired outcome
LoRAs will have the size between 100+ MB and 1GB, much less than a checkpoint
They are not functional without the major model they are trained on (a WAN type checkpoint)
They can be added freely to the main model as adapter
Try not to mix LoRAs for other major checkpoints into another like LoRAs for WAN 2.1 into WAN 2.2 models if you do not precisely know what you are doing -> This can be done, but may have unstable or corrupting results
File types of checkpoints (.safetensor / .gguf / .pth)
.safetensor is the most common and efficient format, the de facto standard for AI-checkpoints
.gguf is a format invented to run on high quantizations with build-in dependencies, imagine it as a container where everything can be put in to work
.pth is an old format with high security risks, avoid using .pth files
Quantisation
Describes how much bits are there for the model to do their thing
More bits = better, smarter, more accurate - more RAM/VRAM needed
Lower bits = higher compression, loosing precision and understanding - less RAM/VRAM needed
Descending for safetensor: FP32, bf16, FP16, FP8, NF4 ...
Descending for gguf: 8Q, Q6, Q4, Q3 ...
As a rule of thumb for consumer grade GPUs a good compromise between precision and compression is:
24/32GB VRAM + 64/128 GB RAM: BF16/FP16 or FP8, Q8
16 GB VRAM + 32/64GB RAM: FP8 or Q8
<16 GB VRAM you have to sacrifice a good amount of precision and use what ever does not OOM, most likely gguf Q6 or lower or use high amounts of swap/pagefile and sacrificing a lot of speed
Weight dtype e4m3fn / e5m2
e4m3fn more precise less dynamic
Better for Nvidia 4000 Series or newer (CUDA)
Usable with "Zluda" for AMD
e5m2 more dynamic less precise
more compatible with older hardware in term of torch-compile
Most compatible with AMD
Basically it interchanges 1 bit ether for precision or dynamic range.
It does matter if you have many steps or want to use a torch-compile mechanism.
I2V /i2v
Image To Video, tells you that the model is for generating videos out of a provided initial image and provided text as guidance, only text will not work
The "Init Image Creativity" must be set to 0 (zero)
T2V /t2v
Text To Video, tells you that the model can generate videos out of text, but without initial images
FLF, FLF2V, End Frame, Last Frame
These are synonyms for i2v techniques where you are able to provide a initial image and a end image
This will use both images to let the video transition from the initial to the end image to provide the control to a desired outcome
Sampler + scheduler
The sampler is the provided algorithm to solve your problem, e.g. to make the video from your input (t2v/i2v)
It will have a heavy impact on how the result turns out or how much noise/variance is used
The Sampler will use a scheduler as a given rule this will also have impact on your outcome
They are always used as pairs
Steps
The amount of steps the sampler+scheduler will use to generate the outcome
More steps equal more time, rule of thumb is each step uses a linear amount of time to be done, so 4 steps take half the time of 8 and so on.
CFG
The amount of guidance the sampler+scheduler will use with the provided input (i2v/t2v) to produce the desired outcome.
More guidance means more force to your input on the output
More cfg also means more computing time must be used
Too much cfg will burn your outcome by too much pressure (imagine fitting a too big object into a tiny poor place 😥)
CFG of 1 disables the negative prompting
Noise
Is the amount of randomness and therefore motion, action, added things, variation
OOM / oom
Out Of Memory, the point where your VRAM and RAM together are not able to fit all the data in and will run out of space. This will result in the abortion of your generation
Speed-Up's - Lightspeed, Lightning, Litespeed, Distilled and others
This are all techniques to drastically reduce the steps needed to produce a video
This will sacrifice some precision and quality and sometimes motions speeds
They will cut the time needed to produce the video and at some point reduce RAM/VRAM requirements
Use if you can life with the quality loss
VAE
A VAE file usually contains the saved weights. or simply a description of how the checkpoint is structured. WAN 2.2 14B can only be used with a WAN 2.1 VAE file. Strange, but the VAE 2.2 is made for the 5B model, which I will not take into account here.
CLIP
Basically a CLIP file is for encoding the text and/or vision, so the checkpoint can understand what your prompt or input is.
Seed
This number is the unique number that is used to generate your noise. Normally the same seed and settings will always produce the same output.
But if you alter any setting (resolution, aspect, LoRA strength, ...) you will get a different output even with the same seed.
Frames per second (FPS)
This is how many single images/pictures are going into 1 second of time.
So 16 FPS means you got 16 chained images in 1 second.
Here are the math: 80 images in 16 fps = 5s
Frame Interpolation
They is a technique that allows to add frames inside the actual generated frames for higher FPS (smoother playback). This will not add details to your video in any way, just adding frames in between 2 reference images with minimal changes to the added frame.
Typically doubling your FPS from like 16 -> 32.
You can decide how many frames to interpolate between each frame in the video.
Higher values are smoother, but make take significant time to save the output, and may have quality artifacts. Also it will add up on the size of an video.
🧠 About WAN 2.2 checkpoints
WAN 2.2 is a MoE (Mixture of Experts) model.
That means there are multiple experts (models/checkpoints) that are used together in 2 stages to create your video.
The first HIGH checkpoint adds more noise to provide a higher amount of motion and will not be able to define details.
The second LOW checkpoint adds less noise, but is able to add the details for the generated parts of the HIGH checkpoint.
They are used equally after each other (50%/50%), the first half the HIGH checkpoint will generate a noisy dynamical higher motion video and after that the LOW checkpoint will refine this to add the details, sharpness and fine motion.
⚠️ WAN 2.2 supports resolutions 480p and 720p, almost anything outside this boundary will introduce heavy artefacts. But anything inside will also produce comparable good results.
🎲 At the time writing WAN 2.2 will always be a gamble for a outcome, you will almost always need more than 1 try, expect many more tries to get what you want.
⛔️ Do not expect:
A WAN checkpoint to produce the desired result on the first try
WAN 2.2 is very prompt sensitive, 1 word may alter the outcome completely
Basic WAN 2.2 does not know anything about sensitive/explicit content
What WAN 2.2 does not know and you prompt for will produce trash/flickering/anything else
WAN 2.2 is a major upgrade to WAN 2.1 in understanding and motion, but far from perfect.
🤖 WAN 2.2 requirements
Your favourite backend like SwarmUI or ComfyUI
Enough RAM + VRAM and space on your SSD
You need a HIGH and LOW checkpoint
They have to be the same file type and quantization to produce desired results
The chosen checkpoint should meet your system specs
You need to know the basic settings of the checkpoint you are using (sampler+scheduler, steps, cfg) and use them!
Use 50% steps on HIGH and 50% steps on LOW
(optional) LoRAs you want to use for a specific outcome
The prompt you want to use
An image if you want i2v
🫴 Guidance on how to use it (i2v)
If you use i2v you always needs a initial image
"Init Image Creativity" must be set to 0 (zero)
Speed-Up's (LoRAs/build-in) need a CFG of 1
This will disable the negative prompt, so do not bother!
Set steps according to the model description, like 4 or 8 steps (speed-up's) or 20-30 steps (native)
Set the correct aspect ratio for the video corresponding to the initial image aspect, if the image is 9:16 the video should be too
Set a output video resolution equivalent to your system specs
Too high will OOM or drastically increase time to produce the video
Try multiple resolutions till you found a good balance between time needed and your desired quality
The resolution must match the aspect ratio and the boundaries of the model (480p up to 720p)
Make sure your VAE is a WAN VAE 2.1 file
Load a compatible CLIP file
If you use a FP8 checkpoint you could use a fp8 clip
clip and checkpoint do not have to match, but scaled clips will help to reduce RAM/VRAM consumption if you are limited
For WAN 2.2 FP8 like my "DaSiWa-WAN 2.2 I2V 14B Lightspeed" "umt5_xxl_fp8_e4m3fn_scaled" is a good choice
Add LoRAs for trained outputs
You can add any fitting LoRA that is compatible with WAN 2.2 and the checkpoint
Do not just try to mix i2v LoRAs with t2v checkpoints or vice versa
Do not try using LoRAs that are trained for other basic models like mixing WAN 2.2 5B LoRAs into the 14B models
Mind the LoRAs creator usage guidelines
Be creative with mixing and using the strength modifier
By using multiple LoRAs most of the time it will work better with lower strength for each
Strength over 1.5 will most likely result in burned output
Not all combinations and checkpoints work the same with every LoRA, try multiple combinations
A LoRA will not magically result in perfect outcomes, you may try multiple times (seeds) or same seed with different settings
Shift (Sigma Shift) - WAN 2.2 will want 5, but experiments with 3 till 8 are possible
This influences how much time the model spends on high noise to refine it
More sigma leads to more detail with less variation
Less sigma to broader details with more variation
This indirectly influences motion (read "Steps and motion")
Options if you OOM
more RAM/VRAM
Higher quantised checkpoints (from FP16 to FP8 / from Q8 to Q6)
Adding virtual RAM (swap, pagefile, zRAM)
Lowering anything of these parameters: resolution, steps, frame count
✋ WAN 2.2 limitations / known issues
WAN 2.2 is far from perfect, yet it is the most advanced open source video generation model.
Here are some known limitations of any WAN base checkpoint:
Heavily censored
MoE is difficult to use and set-up
Heavy on needed resources
local usage only with scaled/quantised checkpoints
Every generation with the exact same settings (variable seed) will very much differ from each other
LoRAs give a good guidance, but not every time the desired results
Training is difficult
Detailed movements from fine things like eyes, fingers, lips, hair are all grainy on lower resolutions
It has a strong bias on adding movements to lips, eyes and fingers
You have to use good prompts for good results. It is not very forgiving. It is very picky with prompting, every word can change the result completely.
🔍 Explanation about frames and times
WAN 2.2 is basically trained for 720p @24 FPS.
It will work good with 16 FPS and lower resolutions down to 480p, but will lose motion.
Almost all LoRAs will work with 16 FPS and 81 frames - referring to a total of 5 seconds
So you want to have a similar output of 5 seconds to get optimal results. Longer videos can be generated, but will most likely degrade in quality and alter the prompt adherence.
WAN 2.2 wants a total frame count of (n/16) +1 or (n/24)+1
You want a full number of seconds with total frames:
Example:
81 (80+1) frames @16 FPS = 5 seconds
97 (96+1) frames @16 FPS = 6 seconds
121 (120+1) frames @24 FPS = 5 seconds
What you not want to do is cutting frames like:
81 (80+1) frames @24 FPS = 3,4 seconds
This will cut your motion or add artifacts
👣 Steps and motion
Since the HIGH checkpoint is for the motion and the LOW for details you can experiment by not going to use the standard 50/50 rule.
What does that mean:
You always want to have at least the minimum amount of steps a checkpoint/model want
You can change the steps at which point that swap from HIGH to LOW happens
More steps in HIGH means more motion, less detail
More steps in LOW means more detail, less motion
Example:
You have a 4-step checkpoint.
You could do 6 total steps, 4 in HIGH and 2 in LOW
This will add more motion while getting the needed 2 steps in LOW for the details, but will increase computing time and requirements.
LOW steps always means slowdown a bit, without slowdown no fine details.
You could sacrifice LOW steps to get more motion: 3 steps HIGH, 1 step LOW.
You could sacrifice HIGH steps to get super fine details: 1 step HIGH, 3 steps LOW.
Extra information:
You can watch all these examples with high motion lacking details on hands, eyes, hair, liquids, introducing artifices.
All videos that are slower paced have the details, but are less action.
On speed-up checkpoints and LoRAs you can degrade your video by adding too much steps, so do not overdo this. They can be run 2x standard steps, like 4-step can be run with 8, everything above is just useless or will degrade.
Basic WAN 2.2 can run in a range with 20-40 steps.
👀 Frame Interpolation
There are multiple techniques to add your extra frames. They all do a good job.
The most common are: RIFE, FILM, GIMM-VFI.
The standard is RIFE and very fast
FILM is higher quality, but slow
GIMM-VFI is the most advanced and highest quality, but needs significant computing time.
It will add up on resources needed and time to generate a video, but will not add much to the quality. Even worse, with to many interpolation you will get artifects.
If you really want to have that little bit of extra smoothness go with 1 interpolated frame, doubling your total frames in your given seconds.
Sometimes the wording inside the various applications differs. So in SwarmUI a multiplier by 1 means disabled. A multiplier by 2 means double frame rate.
On 5 seconds videos in medium (480p) till high (720p) resolutions the differences are very low.
My advice is to try without or try RIFE first.
🎏Up-casting
This process means you just fit your initial video in a other desired frame rate.
You did 81 frames video (5s in 16 fps) and up-cast it to 32 fps (without frame interpolation) you get that 81 frames with 32 fps resulting in 2.5s.
This will play your whole frames in just half the time, this will look fast, but will be short.
On the other hand if you have a 20s video and want it to play just faster in 10s without doing double the frames this would be a good method to do it.
📝 Basic examples on how to write prompts
Prompting checklist
Here comes the fun part. Remember every word counts! There is no absolute formula, but here are some basic ✅ Do's and ❌ Dont's I would recommend for the beginning:
✅ Write 1-2 sentences of the actual setting, even if you provide an initial image
✅ Provide details and choose precise words - "blue eyes" are better than just "eyes"; "white" is not "whitish"; ...
✅ Describe what should happen in active speech ~ing "doing, walking, blinking,..."
✅ Describe sequential actions separated by periods "."
✅ Get your right order, the order of the description matters a lot
✅ Be creative with wording, try to change "zoom" with "pen" if something is not working as expected
✅ Long videos (>5s) need more prompting or WAN 2.2 will start to guess and repeat or add things up
✅ Different resolution/aspect ratios may alter the outcome completely even with the same prompt, every setting matters - In the end WAN 2.2 is trained for 720p (16:9)
✅ Do some fast samples with low resolution (e.g. 368x624) to get a hint if your prompt and other settings are reflecting your desired outcome, than raise to your favourite resolution
❌ No 1-liner, it can, but mostly will not work
❌ Do not endlessly repeat a action, until you want a messy repetition of that action
❌ Don't let WAN guess things up, because you did not include it in the prompt
Prompt structure
The standard structure of a prompt could be like this:
Scene > Action > Mood > camera composition
Scene: A Woman with blue eyes and dinner dress is standing next to a round wooden table, she is holding a glass filled with milk.
Action: She is raising her hand with the glass to her mouth, drinking the mild in one sip.
Mood: The room is filled with warm light.
Camera composition: The camera is fixed. Still camera.
A Woman with blue eyes and black dinner dress is standing next to a round wooden table, she is holding a glass filled with milk. She is raising her hand with the glass to her mouth, drinking the mild in one sip.The room is filled with warm light. The camera is fixed. Still camera.Testing your prompt
The prompt is the most impotent part. I render multiple prompts with low resolution to see if the video is going in the right direction. If things happen as expected I start using higher resolution, what will add details. So I can test if a prompt is good for the situation or not. The resolution has a good impact on the outcome, so a low resolution will not get the same result as a high resolution with the same prompt.
But the overall direction will still be the same. So you can elaborate and get as near as you can to your desired result.
☄️ Speed-Up tricks
They have all the same goal: Lowering the steps needed to get your result.
Just keep in mind this power comes from sacrifice. They all sacrifice quality against speed.
Sometimes more, sometimes less.
Here are my opinion on the 2 common speed-up tricks.
SageAttention
Will save you time, but not very compatible with other speed-up's
Complicated to set up
Need many initial steps <10 to bring good results, will save you time on succession
It will make issues, introduce artifacts, and more ...
Sometimes incompatible to LoRAs and other add-ons
Will slightly reduce quality
(I don't use this at all ... more problems than benefits)
Distilled/SelfForcing - aka Lightning (4-8 step LoRAs/Checkpoints)
Minor quality loss, huge benefit
This will cut your time needed in fractions of the original
Extreme high compatibility to LoRAs and other add-ons
🧺 Caching (memory cache implementations)
Both known caching implementations are a speed up by caching things whit successive steps. The most benefit arises if there are many steps, with a low step count the benefit is low. All these will degrade visual quality by an amount, more or less. They are a tool if you are fine with the quality sacrifice for speed. Mind, that on speed-up like 4 steps the inference speed gain is not as significant.
MagCache
Uses a natural pattern in how models work — the "magnitude" (size) of changes between steps decreases predictably — to decide when to skip steps.
Needs just one random prompt to calibrate, making it faster, simpler, and more reliable across different models and prompts.
TeaCache
Skips unnecessary steps in video generation by predicting when it's safe to do so, based on learned patterns from many prompts.
Needs calibration on 70 different prompts, which is slow and can overfit (works well only on similar prompts).
Comparison
Speed: MagCache is faster — up to 2.8× speedup vs TeaCache’s 1.6×.
Quality: MagCache keeps better video quality (higher SSIM, PSNR, lower LPIPS). TeaCache often causes blurry or distorted videos, especially in color and details.
MagCache is simpler, faster, and better quality than TeaCache. It’s more general, needs less setup, and works well across different video models.
🔝 GPU considerations (which GPU to use)
S-Tier
Nvidia
All Nvidia GPUs from the last 10 years (since Maxwell/GTX 900) are supported in pytorch and they work very well.
3000 series and above are recommended for best performance. More VRAM is always preferable.
Why you should avoid older generations if you can.
Older generations of cards will work however performance might be worse than expected because they don't support certain operations.
Here is a quick summary of what is supported on each generation:
50 series (blackwell): fp16, bf16, fp8, fp4
40 series (ada): fp16, bf16, fp8
30 series (ampere): fp16, bf16
20 series (turing): fp16
10 series (pascal) and below: only slow full precision fp32.
Models are inferenced in fp16 or bf16 for best quality depending on the model with the option for fp8 on some models for less memory/more speed at lower quality.
Note that this table doesn't mean that it's completely unsupported to use fp16 on 10 series for example it just means it's going to be slower because the GPU can't handle it natively.
Don't be tempted by the cheap pascal workstation cards with lots of vram, your performance will be bad.
Anything older than 2000 series like Volta or Pascal should be avoided because they are about to be deprecated in cuda 13.
B Tier
AMD (Linux)
Officially supported in pytorch.
Works well if the card is officially supported by ROCm but can be a bit slow compared to price equivalent Nvidia GPUs depending on the GPU. The later the GPU generation the better things work.
RDNA 4, MI300X: Confirmed "A tier" experience on latest ComfyUI and latest pytorch nightly.
Unsupported cards might be a real pain to get running.
AMD (Windows)
Official pytorch version that works but can be a bit slow compared to the Linux builds. Oldest officialy supported generation is the 7000 series.
Intel (Linux + Windows)
Officially supported in pytorch. People seem to get it working fine.
D Tier
Mac with Apple silicon
Officially supported in pytorch. It works but they love randomly breaking things with OS updates.
Very slow. A lot of ops are not properly supported. No fp8 support at all.
F Tier
Qualcomm AI PC
Pytorch doesn't work at all.
They are: "working on it", until they do actually get it working I recommend avoiding them completely because it might take them so long to make it work that the current hardware will be completely obsolete.
♨️ Source of GPU recommendations is comfyui
Congratulation you reached the END!
Now go and make that awesome art you are here for! ~ 💥
Remember to post something to the resource pages of the person you want to support by hitting this buttons:






