Sign In

Darksidewalker's WAN 2.2 14B I2V - Usage guide - Definitive Edition

194

Darksidewalker's WAN 2.2 14B I2V - Usage guide - Definitive Edition

This guide will provide you:

  • 👄 Explanation of wordings and basics

  • 🧠 About WAN 2.2 checkpoints

  • 🤖 WAN 2.2 requirements

  • 🫴 Guidance on how to use it

  • ✋ WAN 2.2 limitations / known issues

  • 🔍 Explanation about frames and times

  • 👣 Steps and motion

  • 👀 Frame Interpolation

  • 🎏 Up-casting

  • 📝 Basic examples on how to write prompts

  • ☄️ Speed-Up tricks

  • 🧺 Caching (memory cache implementations)

  • 🔝 GPU considerations/recommendations

This is based on information and SOTA implementations as of 10/2025

TL/DR: You now missed the point of a guide! 🤣

So let's gooo! ~

Introduction

I'll focus on the basics. They are the same regardless of the software you use. A fair amount of basics and setting will work for t2v too. T2V is just simpler to use.

👄Explanation of wordings and basics

Checkpoint

  • refers to the file and type of AI model provided

Model

  • is the checkpoint type and generation in use to provide your inference

WAN

  • Is the model group provided by WANAI for making videos out of text or images

LoRA

  • Think of it like a micro checkpoint/model trained for a desired outcome

  • LoRAs will have the size between 100+ MB and 1GB, much less than a checkpoint

  • They are not functional without the major model they are trained on (a WAN type checkpoint)

  • They can be added freely to the main model as adapter

  • Try not to mix LoRAs for other major checkpoints into another like LoRAs for WAN 2.1 into WAN 2.2 models if you do not precisely know what you are doing -> This can be done, but may have unstable or corrupting results

File types of checkpoints (.safetensor / .gguf / .pth)

  •  .safetensor is the most common and efficient format, the de facto standard for AI-checkpoints

  • .gguf is a format invented to run on high quantizations with build-in dependencies, imagine it as a container where everything can be put in to work

  • .pth is an old format with high security risks, avoid using .pth files

Quantisation

  • Describes how much bits are there for the model to do their thing

  • More bits = better, smarter, more accurate - more RAM/VRAM needed

  • Lower bits = higher compression, loosing precision and understanding - less RAM/VRAM needed

  • Descending for safetensor: FP32, bf16, FP16, FP8, NF4 ...

  • Descending for gguf: 8Q, Q6, Q4, Q3 ...

  • As a rule of thumb for consumer grade GPUs a good compromise between precision and compression is:

    • 24/32GB VRAM + 64/128 GB RAM: BF16/FP16 or FP8, Q8

    • 16 GB VRAM + 32/64GB RAM: FP8 or Q8

    • <16 GB VRAM you have to sacrifice a good amount of precision and use what ever does not OOM, most likely gguf Q6 or lower or use high amounts of swap/pagefile and sacrificing a lot of speed

Weight dtype e4m3fn / e5m2

  • e4m3fn more precise less dynamic

    • Better for Nvidia 4000 Series or newer (CUDA)

    • Usable with "Zluda" for AMD

  • e5m2 more dynamic less precise

    • more compatible with older hardware in term of torch-compile

    • Most compatible with AMD

Basically it interchanges 1 bit ether for precision or dynamic range.

It does matter if you have many steps or want to use a torch-compile mechanism.

I2V /i2v

  • Image To Video, tells you that the model is for generating videos out of a provided initial image and provided text as guidance, only text will not work

  • The "Init Image Creativity" must be set to 0 (zero)

T2V /t2v

  • Text To Video, tells you that the model can generate videos out of text, but without initial images

FLF, FLF2V, End Frame, Last Frame

  • These are synonyms for i2v techniques where you are able to provide a initial image and a end image

  • This will use both images to let the video transition from the initial to the end image to provide the control to a desired outcome

Sampler + scheduler

  • The sampler is the provided algorithm to solve your problem, e.g. to make the video from your input (t2v/i2v)

  • It will have a heavy impact on how the result turns out or how much noise/variance is used

  • The Sampler will use a scheduler as a given rule this will also have impact on your outcome

  • They are always used as pairs

Steps

  • The amount of steps the sampler+scheduler will use to generate the outcome

  • More steps equal more time, rule of thumb is each step uses a linear amount of time to be done, so 4 steps take half the time of 8 and so on.

CFG

  • The amount of guidance the sampler+scheduler will use with the provided input (i2v/t2v) to produce the desired outcome.

  • More guidance means more force to your input on the output

  • More cfg also means more computing time must be used

  • Too much cfg will burn your outcome by too much pressure (imagine fitting a too big object into a tiny poor place 😥)

  • CFG of 1 disables the negative prompting

Noise

  • Is the amount of randomness and therefore motion, action, added things, variation

OOM / oom

  • Out Of Memory, the point where your VRAM and RAM together are not able to fit all the data in and will run out of space. This will result in the abortion of your generation

Speed-Up's - Lightspeed, Lightning, Litespeed, Distilled and others

  • This are all techniques to drastically reduce the steps needed to produce a video

  • This will sacrifice some precision and quality and sometimes motions speeds

  • They will cut the time needed to produce the video and at some point reduce RAM/VRAM requirements

  • Use if you can life with the quality loss

VAE

A VAE file usually contains the saved weights. or simply a description of how the checkpoint is structured. WAN 2.2 14B can only be used with a WAN 2.1 VAE file. Strange, but the VAE 2.2 is made for the 5B model, which I will not take into account here.

CLIP

Basically a CLIP file is for encoding the text and/or vision, so the checkpoint can understand what your prompt or input is.

Seed

This number is the unique number that is used to generate your noise. Normally the same seed and settings will always produce the same output.

But if you alter any setting (resolution, aspect, LoRA strength, ...) you will get a different output even with the same seed.

Frames per second (FPS)

This is how many single images/pictures are going into 1 second of time.

So 16 FPS means you got 16 chained images in 1 second.

Here are the math: 80 images in 16 fps = 5s

Frame Interpolation

They is a technique that allows to add frames inside the actual generated frames for higher FPS (smoother playback). This will not add details to your video in any way, just adding frames in between 2 reference images with minimal changes to the added frame.

Typically doubling your FPS from like 16 -> 32.
You can decide how many frames to interpolate between each frame in the video.
Higher values are smoother, but make take significant time to save the output, and may have quality artifacts. Also it will add up on the size of an video.

🧠 About WAN 2.2 checkpoints

WAN 2.2 is a MoE (Mixture of Experts) model.

That means there are multiple experts (models/checkpoints) that are used together in 2 stages to create your video.

The first HIGH checkpoint adds more noise to provide a higher amount of motion and will not be able to define details.

The second LOW checkpoint adds less noise, but is able to add the details for the generated parts of the HIGH checkpoint.

They are used equally after each other (50%/50%), the first half the HIGH checkpoint will generate a noisy dynamical higher motion video and after that the LOW checkpoint will refine this to add the details, sharpness and fine motion.

⚠️ WAN 2.2 supports resolutions 480p and 720p, almost anything outside this boundary will introduce heavy artefacts. But anything inside will also produce comparable good results.

🎲 At the time writing WAN 2.2 will always be a gamble for a outcome, you will almost always need more than 1 try, expect many more tries to get what you want.

⛔️ Do not expect:

  • A WAN checkpoint to produce the desired result on the first try

  • WAN 2.2 is very prompt sensitive, 1 word may alter the outcome completely

  • Basic WAN 2.2 does not know anything about sensitive/explicit content

  • What WAN 2.2 does not know and you prompt for will produce trash/flickering/anything else

WAN 2.2 is a major upgrade to WAN 2.1 in understanding and motion, but far from perfect.

🤖 WAN 2.2 requirements

  1. Your favourite backend like SwarmUI or ComfyUI

  2. Enough RAM + VRAM and space on your SSD

  3. You need a HIGH and LOW checkpoint

    • They have to be the same file type and quantization to produce desired results

    • The chosen checkpoint should meet your system specs

    • You need to know the basic settings of the checkpoint you are using (sampler+scheduler, steps, cfg) and use them!

    • Use 50% steps on HIGH and 50% steps on LOW

  4. (optional) LoRAs you want to use for a specific outcome

  5. The prompt you want to use

  6. An image if you want i2v

🫴 Guidance on how to use it (i2v)

  1. If you use i2v you always needs a initial image

  2. "Init Image Creativity" must be set to 0 (zero)

  3. Speed-Up's (LoRAs/build-in) need a CFG of 1

    • This will disable the negative prompt, so do not bother!

  4. Set steps according to the model description, like 4 or 8 steps (speed-up's) or 20-30 steps (native)

  5. Set the correct aspect ratio for the video corresponding to the initial image aspect, if the image is 9:16 the video should be too

  6. Set a output video resolution equivalent to your system specs

    • Too high will OOM or drastically increase time to produce the video

    • Try multiple resolutions till you found a good balance between time needed and your desired quality

    • The resolution must match the aspect ratio and the boundaries of the model (480p up to 720p)

  7. Make sure your VAE is a WAN VAE 2.1 file

  8. Load a compatible CLIP file

    • If you use a FP8 checkpoint you could use a fp8 clip

    • clip and checkpoint do not have to match, but scaled clips will help to reduce RAM/VRAM consumption if you are limited

    • For WAN 2.2 FP8 like my "DaSiWa-WAN 2.2 I2V 14B Lightspeed" "umt5_xxl_fp8_e4m3fn_scaled" is a good choice 

  9. Add LoRAs for trained outputs

    • You can add any fitting LoRA that is compatible with WAN 2.2 and the checkpoint

    • Do not just try to mix i2v LoRAs with t2v checkpoints or vice versa

    • Do not try using LoRAs that are trained for other basic models like mixing WAN 2.2 5B LoRAs into the 14B models

    • Mind the LoRAs creator usage guidelines

    • Be creative with mixing and using the strength modifier

    • By using multiple LoRAs most of the time it will work better with lower strength for each

    • Strength over 1.5 will most likely result in burned output

    • Not all combinations and checkpoints work the same with every LoRA, try multiple combinations

    • A LoRA will not magically result in perfect outcomes, you may try multiple times (seeds) or same seed with different settings

  10. Shift (Sigma Shift) - WAN 2.2 will want 5, but experiments with 3 till 8 are possible

    • This influences how much time the model spends on high noise to refine it

    • More sigma leads to more detail with less variation

    • Less sigma to broader details with more variation

    • This indirectly influences motion (read "Steps and motion")

  11. Options if you OOM

    • more RAM/VRAM

    • Higher quantised checkpoints (from FP16 to FP8 / from Q8 to Q6)

    • Adding virtual RAM (swap, pagefile, zRAM)

    • Lowering anything of these parameters: resolution, steps, frame count

✋ WAN 2.2 limitations / known issues

WAN 2.2 is far from perfect, yet it is the most advanced open source video generation model.

Here are some known limitations of any WAN base checkpoint:

  • Heavily censored 

  • MoE is difficult to use and set-up

  • Heavy on needed resources

    • local usage only with scaled/quantised checkpoints

  • Every generation with the exact same settings (variable seed) will very much differ from each other

  • LoRAs give a good guidance, but not every time the desired results

  • Training is difficult

  • Detailed movements from fine things like eyes, fingers, lips, hair are all grainy on lower resolutions

  • It has a strong bias on adding movements to lips, eyes and fingers

  • You have to use good prompts for good results. It is not very forgiving. It is very picky with prompting, every word can change the result completely.

🔍 Explanation about frames and times

WAN 2.2 is basically trained for 720p @24 FPS.

It will work good with 16 FPS and lower resolutions down to 480p, but will lose motion.

Almost all LoRAs will work with 16 FPS and 81 frames - referring to a total of 5 seconds

So you want to have a similar output of 5 seconds to get optimal results. Longer videos can be generated, but will most likely degrade in quality and alter the prompt adherence.

WAN 2.2 wants a total frame count of (n/16) +1 or (n/24)+1

You want a full number of seconds with total frames:

Example:

  • 81 (80+1) frames @16 FPS = 5 seconds

  • 97 (96+1) frames @16 FPS = 6 seconds

  • 121 (120+1) frames @24 FPS = 5 seconds

What you not want to do is cutting frames like:

  • 81 (80+1) frames @24 FPS = 3,4 seconds

  • This will cut your motion or add artifacts

👣 Steps and motion

Since the HIGH checkpoint is for the motion and the LOW for details you can experiment by not going to use the standard 50/50 rule.

What does that mean:

  • You always want to have at least the minimum amount of steps a checkpoint/model want

  • You can change the steps at which point that swap from HIGH to LOW happens

  • More steps in HIGH means more motion, less detail

  • More steps in LOW means more detail, less motion


Example:

You have a 4-step checkpoint.

You could do 6 total steps, 4 in HIGH and 2 in LOW

This will add more motion while getting the needed 2 steps in LOW for the details, but will increase computing time and requirements.

LOW steps always means slowdown a bit, without slowdown no fine details.

You could sacrifice LOW steps to get more motion: 3 steps HIGH, 1 step LOW.

You could sacrifice HIGH steps to get super fine details: 1 step HIGH, 3 steps LOW.


Extra information:

You can watch all these examples with high motion lacking details on hands, eyes, hair, liquids, introducing artifices.

All videos that are slower paced have the details, but are less action.

On speed-up checkpoints and LoRAs you can degrade your video by adding too much steps, so do not overdo this. They can be run 2x standard steps, like 4-step can be run with 8, everything above is just useless or will degrade.

Basic WAN 2.2 can run in a range with 20-40 steps.

👀 Frame Interpolation

There are multiple techniques to add your extra frames. They all do a good job. 

The most common are: RIFE, FILM, GIMM-VFI.

  • The standard is RIFE and very fast

  • FILM is higher quality, but slow

  • GIMM-VFI is the most advanced and highest quality, but needs significant computing time.

It will add up on resources needed and time to generate a video, but will not add much to the quality. Even worse, with to many interpolation you will get artifects.

If you really want to have that little bit of extra smoothness go with 1 interpolated frame, doubling your total frames in your given seconds.

Sometimes the wording inside the various applications differs. So in SwarmUI a multiplier by 1 means disabled. A multiplier by 2 means double frame rate.

On 5 seconds videos in medium (480p) till high (720p) resolutions the differences are very low.

My advice is to try without or try RIFE first.


🎏Up-casting

This process means you just fit your initial video in a other desired frame rate.

You did 81 frames video (5s in 16 fps) and up-cast it to 32 fps (without frame interpolation) you get that 81 frames with 32 fps resulting in 2.5s.

This will play your whole frames in just half the time, this will look fast, but will be short.

On the other hand if you have a 20s video and want it to play just faster in 10s without doing double the frames this would be a good method to do it.

📝 Basic examples on how to write prompts

Prompting checklist

Here comes the fun part. Remember every word counts! There is no absolute formula, but here are some basic ✅ Do's and ❌ Dont's I would recommend for the beginning:

✅ Write 1-2 sentences of the actual setting, even if you provide an initial image

✅ Provide details and choose precise words - "blue eyes" are better than just "eyes"; "white" is not "whitish"; ...

✅ Describe what should happen in active speech ~ing "doing, walking, blinking,..."

✅ Describe sequential actions separated by periods "."

✅ Get your right order, the order of the description matters a lot

✅ Be creative with wording, try to change "zoom" with "pen" if something is not working as expected

✅ Long videos (>5s) need more prompting or WAN 2.2 will start to guess and repeat or add things up

✅ Different resolution/aspect ratios may alter the outcome completely even with the same prompt, every setting matters - In the end WAN 2.2 is trained for 720p (16:9)

✅ Do some fast samples with low resolution (e.g. 368x624) to get a hint if your prompt and other settings are reflecting your desired outcome, than raise to your favourite resolution

❌ No 1-liner, it can, but mostly will not work

❌ Do not endlessly repeat a action, until you want a messy repetition of that action

❌ Don't let WAN guess things up, because you did not include it in the prompt


Prompt structure

The standard structure of a prompt could be like this:

Scene > Action > Mood > camera composition

Scene: A Woman with blue eyes and dinner dress is standing next to a round wooden table, she is holding a glass filled with milk.

Action: She is raising her hand with the glass to her mouth, drinking the mild in one sip.

Mood: The room is filled with warm light.

Camera composition: The camera is fixed. Still camera.

A Woman with blue eyes and black dinner dress is standing next to a round wooden table, she is holding a glass filled with milk. She is raising her hand with the glass to her mouth, drinking the mild in one sip.The room is filled with warm light. The camera is fixed. Still camera.

Testing your prompt

The prompt is the most impotent part. I render multiple prompts with low resolution to see if the video is going in the right direction. If things happen as expected I start using higher resolution, what will add details. So I can test if a prompt is good for the situation or not. The resolution has a good impact on the outcome, so a low resolution will not get the same result as a high resolution  with the same prompt.

But the overall direction will still be the same. So you can elaborate and get as near as you can to your desired result.


☄️ Speed-Up tricks

They have all the same goal: Lowering the steps needed to get your result.

Just keep in mind this power comes from sacrifice. They all sacrifice quality against speed.

Sometimes more, sometimes less.

Here are my opinion on the 2 common speed-up tricks.

  • SageAttention

    • Will save you time, but not very compatible with other speed-up's

    • Complicated to set up

    • Need many initial steps <10 to bring good results, will save you time on succession

    • It will make issues, introduce artifacts, and more ...

    • Sometimes incompatible to LoRAs and other add-ons

    • Will slightly reduce quality

    • (I don't use this at all ... more problems than benefits)

  • Distilled/SelfForcing - aka Lightning (4-8 step LoRAs/Checkpoints)

    • Minor quality loss, huge benefit

    • This will cut your time needed in fractions of the original

    • Extreme high compatibility to LoRAs and other add-ons


🧺 Caching (memory cache implementations)

Both known caching implementations are a speed up by caching things whit successive steps. The most benefit arises if there are many steps, with a low step count the benefit is low. All these will degrade visual quality by an amount, more or less. They are a tool if you are fine with the quality sacrifice for speed. Mind, that on speed-up like 4 steps the inference speed gain is not as significant.

MagCache

Uses a natural pattern in how models work — the "magnitude" (size) of changes between steps decreases predictably — to decide when to skip steps.

Needs just one random prompt to calibrate, making it faster, simpler, and more reliable across different models and prompts.

TeaCache

Skips unnecessary steps in video generation by predicting when it's safe to do so, based on learned patterns from many prompts.

Needs calibration on 70 different prompts, which is slow and can overfit (works well only on similar prompts).

Comparison

  • Speed: MagCache is faster — up to 2.8× speedup vs TeaCache’s 1.6×.

  • Quality: MagCache keeps better video quality (higher SSIM, PSNR, lower LPIPS). TeaCache often causes blurry or distorted videos, especially in color and details.

  • MagCache is simpler, faster, and better quality than TeaCache. It’s more general, needs less setup, and works well across different video models.


🔝 GPU considerations (which GPU to use)

S-Tier

Nvidia

All Nvidia GPUs from the last 10 years (since Maxwell/GTX 900) are supported in pytorch and they work very well.

3000 series and above are recommended for best performance. More VRAM is always preferable.

Why you should avoid older generations if you can.

Older generations of cards will work however performance might be worse than expected because they don't support certain operations.

Here is a quick summary of what is supported on each generation:

  • 50 series (blackwell): fp16, bf16, fp8, fp4

  • 40 series (ada): fp16, bf16, fp8

  • 30 series (ampere): fp16, bf16

  • 20 series (turing): fp16

  • 10 series (pascal) and below: only slow full precision fp32.

Models are inferenced in fp16 or bf16 for best quality depending on the model with the option for fp8 on some models for less memory/more speed at lower quality.

Note that this table doesn't mean that it's completely unsupported to use fp16 on 10 series for example it just means it's going to be slower because the GPU can't handle it natively.

Don't be tempted by the cheap pascal workstation cards with lots of vram, your performance will be bad.

Anything older than 2000 series like Volta or Pascal should be avoided because they are about to be deprecated in cuda 13.

B Tier

AMD (Linux)

Officially supported in pytorch.

Works well if the card is officially supported by ROCm but can be a bit slow compared to price equivalent Nvidia GPUs depending on the GPU. The later the GPU generation the better things work.

RDNA 4, MI300X: Confirmed "A tier" experience on latest ComfyUI and latest pytorch nightly.

Unsupported cards might be a real pain to get running.

AMD (Windows)

Official pytorch version that works but can be a bit slow compared to the Linux builds. Oldest officialy supported generation is the 7000 series.

Intel (Linux + Windows)

Officially supported in pytorch. People seem to get it working fine.

D Tier

Mac with Apple silicon

Officially supported in pytorch. It works but they love randomly breaking things with OS updates.

Very slow. A lot of ops are not properly supported. No fp8 support at all.

F Tier

Qualcomm AI PC

Pytorch doesn't work at all.

They are: "working on it", until they do actually get it working I recommend avoiding them completely because it might take them so long to make it work that the current hardware will be completely obsolete.

♨️ Source of GPU recommendations is comfyui


Congratulation you reached the END!

Now go and make that awesome art you are here for! ~ 💥

Remember to post something to the resource pages of the person you want to support by hitting this buttons:

grafik.png

194