Darksidewalker's WAN 2.2 14B I2V - Usage guide - Definitive Edition

This guide will provide you:

👄 Explanation of wordings and basics
🧠 About WAN 2.2 checkpoints
🤖 WAN 2.2 requirements
🫴 Guidance on how to use it
✋ WAN 2.2 limitations / known issues
🔍 Explanation about frames and times
👣 Steps and motion
👀 Frame Interpolation
🎏 Up-casting
🖼️ Upscaling
📝 Basic examples on how to write prompts
☄️ Speed-Up tricks
💾 Memory management
🧺 Caching (memory cache implementations)
🔝 GPU considerations/recommendations

This is based on information and SOTA implementations as of 02/2026

TL/DR: You now missed the point of a guide! 🤣

So let's gooo! ~

Introduction

I'll focus on the basics. They are the same regardless of the software you use. A fair amount of basics and setting will work for t2v too. T2V is just simpler to use.

👄Explanation of wordings and basics

Checkpoint

refers to the file and type of AI model provided

Model

is the checkpoint type and generation in use to provide your inference

WAN

Is the model group provided by WANAI for making videos out of text or images

LoRA

Think of it like a micro checkpoint/model trained for a desired outcome
LoRAs will have the size between 100+ MB and 1GB, much less than a checkpoint
They are not functional without the major model they are trained on (a WAN type checkpoint)
They can be added freely to the main model as adapter
Try not to mix LoRAs for other major checkpoints into another like LoRAs for WAN 2.1 into WAN 2.2 models if you do not precisely know what you are doing -> This can be done, but may have unstable or corrupting results

File types of checkpoints (.safetensor / .gguf / .pth)

.safetensor is the most common and efficient format, the de facto standard for AI-checkpoints
.gguf is a format invented to run on high quantizations with build-in dependencies, imagine it as a container where everything can be put in to work
.pth is an old format with high security risks, avoid using .pth files

Quantisation

Describes how much bits are there for the model to do their thing
More bits = better, smarter, more accurate - more RAM/VRAM needed
Lower bits = higher compression, loosing precision and understanding - less RAM/VRAM needed
Descending for safetensor: FP32, bf16, FP16, FP8, NF4 ...
Descending for gguf: 8Q, Q6, Q4, Q3 ...
As a rule of thumb for consumer grade GPUs a good compromise between precision and compression is:
- 24/32GB VRAM + 64/128 GB RAM: BF16/FP16 or FP8, Q8
- 16 GB VRAM + 32/64GB RAM: FP8 or Q8
- <16 GB VRAM you have to sacrifice a good amount of precision and use what ever does not OOM, most likely gguf Q6 or lower or use high amounts of swap/pagefile and sacrificing a lot of speed

Quantisation types and quality estimation

There are many types of quantization. The most common "high-precision" formats are FP32 (full precision), BF16 (Brain Float), and FP16 (half-precision). For more aggressive compression, we use FP8 (simple or mixed) and 4-bit formats like NF4 (NormalFloat) or NVFP4 (NVIDIA's specialized 4-bit float).

GGUF it is a binary container format, but it is not "just" a box. While Safetensors usually stores raw weights (like FP16 or BF16), except mixed-format, GGUF is designed to store quantized weights using specific algorithms like K-Quants (e.g., Q4_K_M) or I-Quants (Importance Matrix). These are unique quantization methods optimized for llama.cpp that allow for "mixed-bit" storage, where different layers of the model are compressed at different intensities.
BF16 / FP16 (The Gold Standard)

Quality: ⭐⭐⭐⭐⭐
No loss in motion consistency or fine texture.
Speed: Baseline. Requires high-end hardware (e.g., 48GB+ VRAM for the 14B model).
LoRA Compatibility: Native/Perfect. Most Wan 2.2 LoRAs are trained and tested on this precision.

FP8 (Scaled / Mixed)

Quality: ⭐⭐⭐⭐✬
Nearly indistinguishable from BF16, though minor "flicker" can occur in complex textures.
Speed: Fastest on modern NVIDIA GPUs (40-series/H100) due to native FP8 hardware acceleration.
LoRA Compatibility: Excellent. Most modern workflows (ComfyUI/Diffusers) support applying LoRAs directly to FP8 weights with minimal shift.

GGUF Q8_0 / Q6_K

Quality: ⭐⭐⭐⭐
Extremely high retention of the MoE (Mixture of Experts) logic in Wan 2.2.
Speed: Moderate. Slower than FP8 on high-end GPUs, but highly efficient for CPU/System RAM offloading.
LoRA Compatibility: Good. Requires specific loaders (like ComfyUI-GGUF) to patch LoRAs into the quantized weights.

NF4 (4-bit NormalFloat)

Quality: ⭐⭐⭐✬
Good for general composition, but you may notice loss in "cinematic" fine details and text rendering.
Speed: Very Fast. Great for mid-range cards (e.g., RTX 3060/4060) to avoid OOM (Out of Memory) errors.
LoRA Compatibility: Fair. LoRA "smearing" can occur where the adapter's effect feels less precise or overly aggressive.

GGUF Q4_K_M / NVFP4

Quality: ⭐⭐⭐
Significant compression. Motion may become slightly more "robotic" or jittery in Wan 2.2's 14B experts.
Speed: High Efficiency. NVFP4 is specifically tuned for speed on Blackwell/Ada architectures.
LoRA Compatibility: Moderate. Fine-tuned details from LoRAs (like specific faces) may lose likeness.

GGUF Q3_K_M / Q3_K_L

Quality: ⭐⭐✬
Dynamic motion remains, but fine textures (hair, skin) become "mushy." Prompt adherence begins to slip.
Speed: Fast. Allows the 14B model to run on 8GB-10GB VRAM cards.
LoRA Compatibility: Weak. LoRAs may fail to "trigger" properly or cause significant color/artifacting issues.

GGUF Q2_K / IQ2_XS

Quality: ⭐
Significant "hallucination" in video frames; Wan 2.2 may struggle to keep the MoE experts synchronized.
Speed: Very Fast (but low utility).
LoRA Compatibility: Poor. Most LoRAs will fail to produce recognizable results at this level of degradation.

Weight dtype e4m3fn / e5m2

e4m3fn more precise less dynamic
- Better for Nvidia 4000 Series or newer (CUDA)
- Usable with "Zluda" for AMD
e5m2 more dynamic less precise
- more compatible with older hardware in term of torch-compile
- Most compatible with AMD

Basically it interchanges 1 bit ether for precision or dynamic range.

It does matter if you have many steps or want to use a torch-compile mechanism.

I2V /i2v

Image To Video, tells you that the model is for generating videos out of a provided initial image and provided text as guidance, only text will not work
The "Init Image Creativity" must be set to 0 (zero)

T2V /t2v

Text To Video, tells you that the model can generate videos out of text, but without initial images

FLF, FLF2V, End Frame, Last Frame

These are synonyms for i2v techniques where you are able to provide a initial image and a end image
This will use both images to let the video transition from the initial to the end image to provide the control to a desired outcome

Sampler + scheduler

The sampler is the provided algorithm to solve your problem, e.g. to make the video from your input (t2v/i2v)
It will have a heavy impact on how the result turns out or how much noise/variance is used

The Sampler will use a scheduler as a given rule this will also have impact on your outcome
They are always used as pairs

Steps

The amount of steps the sampler+scheduler will use to generate the outcome
More steps equal more time, rule of thumb is each step uses a linear amount of time to be done, so 4 steps take half the time of 8 and so on.

CFG

The amount of guidance the sampler+scheduler will use with the provided input (i2v/t2v) to produce the desired outcome.
More guidance means more force to your input on the output
More cfg also means more computing time must be used
Too much cfg will burn your outcome by too much pressure (imagine fitting a too big object into a tiny poor place 😥)
CFG of 1 disables the negative prompting

Noise

Is the amount of randomness and therefore motion, action, added things, variation

OOM / oom

Out Of Memory, the point where your VRAM and RAM together are not able to fit all the data in and will run out of space. This will result in the abortion of your generation

Offloading

This means data is offloaded to another device to save memory from the primary GPU (VRAM) like to the CPU to use system RAM.
It can also sometimes mean to offload from RAM/VRAM to permanent memory like SSD (very slow)
Offloading to SSD (like SWAP, pagefile) can significantly reduce the lifespan over time of your SSD because of the massive amount of data that will be moved!

Speed-Up's - Lightspeed, Lightning, Litespeed, Distilled and others

This are all techniques to drastically reduce the steps needed to produce a video
This will sacrifice some precision and quality and sometimes motions speeds
They will cut the time needed to produce the video and at some point reduce RAM/VRAM requirements
Use if you can life with the quality loss

VAE

A VAE file usually contains the saved weights. or simply a description of how the checkpoint is structured. WAN 2.2 14B can only be used with a WAN 2.1 VAE file. Strange, but the VAE 2.2 is made for the 5B model, which I will not take into account here.

CLIP

Basically a CLIP file is for encoding the text and/or vision, so the checkpoint can understand what your prompt or input is.

Seed

This number is the unique number that is used to generate your noise. Normally the same seed and settings will always produce the same output.

But if you alter any setting (resolution, aspect, LoRA strength, ...) you will get a different output even with the same seed.

Frames per second (FPS)

This is how many single images/pictures are going into 1 second of time.

So 16 FPS means you got 16 chained images in 1 second.

Here are the math: 80 images in 16 fps = 5s

Frame Interpolation

They is a technique that allows to add frames inside the actual generated frames for higher FPS (smoother playback). This will not add details to your video in any way, just adding frames in between 2 reference images with minimal changes to the added frame.

Typically doubling your FPS from like 16 -> 32.
You can decide how many frames to interpolate between each frame in the video.
Higher values are smoother, but make take significant time to save the output, and may have quality artifacts. Also it will add up on the size of an video.

Black output

In 99.9% of circumstances this happen if your ComfyUI installation is outdated and does not support the used precision or quantization. Just update your ComfyUI and custom nodes.

Pixelated / distorted output

The reasons are most likely in this order:

Wrong checkpoint in HIGH or LOW
Using of the same checkpoint in HIGH or LOW
Using wrong step counts for the model
Incompatible scheduler and sampler

🧠 About WAN 2.2 checkpoints

WAN 2.2 is a MoE (Mixture of Experts) model.

That means there are multiple experts (models/checkpoints) that are used together in 2 stages to create your video.

The first HIGH checkpoint adds more noise to provide a higher amount of motion and will not be able to define details.

The second LOW checkpoint adds less noise, but is able to add the details for the generated parts of the HIGH checkpoint.

They are used equally after each other (50%/50%), the first half the HIGH checkpoint will generate a noisy dynamical higher motion video and after that the LOW checkpoint will refine this to add the details, sharpness and fine motion.

⚠️ WAN 2.2 supports resolutions 480p and 720p, almost anything outside this boundary will introduce heavy artefacts. But anything inside will also produce comparable good results.

🎲 At the time writing WAN 2.2 will always be a gamble for a outcome, you will almost always need more than 1 try, expect many more tries to get what you want.

⛔️ Do not expect:

A WAN checkpoint to produce the desired result on the first try
WAN 2.2 is very prompt sensitive, 1 word may alter the outcome completely
Basic WAN 2.2 does not know anything about sensitive/explicit content
What WAN 2.2 does not know and you prompt for will produce trash/flickering/anything else

WAN 2.2 is a major upgrade to WAN 2.1 in understanding and motion, but far from perfect.

🤖 WAN 2.2 requirements

Your favourite backend like SwarmUI or ComfyUI
Enough RAM + VRAM and space on your SSD
You need a HIGH and LOW checkpoint
- They have to be the same file type and quantization to produce desired results
- The chosen checkpoint should meet your system specs
- You need to know the basic settings of the checkpoint you are using (sampler+scheduler, steps, cfg) and use them!
- Use 50% steps on HIGH and 50% steps on LOW
(optional) LoRAs you want to use for a specific outcome
The prompt you want to use
An image if you want i2v

🫴 Guidance on how to use it (i2v)

If you use i2v you always needs a initial image
"Init Image Creativity" must be set to 0 (zero)
Speed-Up's (LoRAs/build-in) need a CFG of 1
- This will disable the negative prompt, so do not bother!
Set steps according to the model description, like 4 or 8 steps (speed-up's) or 20-30 steps (native)
Set the correct aspect ratio for the video corresponding to the initial image aspect, if the image is 9:16 the video should be too
Set a output video resolution equivalent to your system specs
- Too high will OOM or drastically increase time to produce the video
- Try multiple resolutions till you found a good balance between time needed and your desired quality
- The resolution must match the aspect ratio and the boundaries of the model (480p up to 720p)
Make sure your VAE is a WAN VAE 2.1 file
Load a compatible CLIP file
- If you use a FP8 checkpoint you could use a fp8 clip
- clip and checkpoint do not have to match, but scaled clips will help to reduce RAM/VRAM consumption if you are limited
- For WAN 2.2 FP8 like my "DaSiWa-WAN 2.2 I2V 14B Lightspeed" "umt5_xxl_fp8_e4m3fn_scaled" is a good choice
Add LoRAs for trained outputs
- You can add any fitting LoRA that is compatible with WAN 2.2 and the checkpoint
- Do not just try to mix i2v LoRAs with t2v checkpoints or vice versa
- Do not try using LoRAs that are trained for other basic models like mixing WAN 2.2 5B LoRAs into the 14B models
- Mind the LoRAs creator usage guidelines
- Be creative with mixing and using the strength modifier
- By using multiple LoRAs most of the time it will work better with lower strength for each
- Strength over 1.5 will most likely result in burned output
- Not all combinations and checkpoints work the same with every LoRA, try multiple combinations
- A LoRA will not magically result in perfect outcomes, you may try multiple times (seeds) or same seed with different settings
Shift (Sigma Shift) - WAN 2.2 will want 5, but experiments with 3 till 8 are possible
- This influences how much time the model spends on high noise to refine it
- More sigma leads to more detail with less variation
- Less sigma to broader details with more variation
- This indirectly influences motion (read "Steps and motion")
Options if you OOM
- more RAM/VRAM
- Higher quantised checkpoints (from FP16 to FP8 / from Q8 to Q6)
- Adding virtual RAM (swap, pagefile, zRAM)
- Lowering anything of these parameters: resolution, steps, frame count

✋ WAN 2.2 limitations / known issues

WAN 2.2 is far from perfect, yet it is the most advanced open source video generation model.

Here are some known limitations of any WAN base checkpoint:

Heavily censored
MoE is difficult to use and set-up
Heavy on needed resources
- local usage only with scaled/quantised checkpoints

Every generation with the exact same settings (variable seed) will very much differ from each other
LoRAs give a good guidance, but not every time the desired results
Training is difficult
Detailed movements from fine things like eyes, fingers, lips, hair are all grainy on lower resolutions
It has a strong bias on adding movements to lips, eyes and fingers
You have to use good prompts for good results. It is not very forgiving. It is very picky with prompting, every word can change the result completely.

🔍 Explanation about frames and times

WAN 2.2 is basically trained for 720p @24 FPS.

It will work good with 16 FPS and lower resolutions down to 480p, but will lose motion.

Almost all LoRAs will work with 16 FPS and 81 frames - referring to a total of 5 seconds

So you want to have a similar output of 5 seconds to get optimal results. Longer videos can be generated, but will most likely degrade in quality and alter the prompt adherence.

WAN 2.2 wants a total frame count of (n/16) +1 or (n/24)+1

You want a full number of seconds with total frames:

Example:

81 (80+1) frames @16 FPS = 5 seconds
97 (96+1) frames @16 FPS = 6 seconds
121 (120+1) frames @24 FPS = 5 seconds

What you not want to do is cutting frames like:

81 (80+1) frames @24 FPS = 3,4 seconds
This will cut your motion or add artifacts

👣 Steps and motion

Since the HIGH checkpoint is for the motion and the LOW for details you can experiment by not going to use the standard 50/50 rule.

What does that mean:

You always want to have at least the minimum amount of steps a checkpoint/model want
You can change the steps at which point that swap from HIGH to LOW happens
More steps in HIGH means more motion, less detail
More steps in LOW means more detail, less motion

Example:

You have a 4-step checkpoint.

You could do 6 total steps, 4 in HIGH and 2 in LOW

This will add more motion while getting the needed 2 steps in LOW for the details, but will increase computing time and requirements.

LOW steps always means slowdown a bit, without slowdown no fine details.

You could sacrifice LOW steps to get more motion: 3 steps HIGH, 1 step LOW.

You could sacrifice HIGH steps to get super fine details: 1 step HIGH, 3 steps LOW.

Extra information:

You can watch all these examples with high motion lacking details on hands, eyes, hair, liquids, introducing artifices.

All videos that are slower paced have the details, but are less action.

On speed-up checkpoints and LoRAs you can degrade your video by adding too much steps, so do not overdo this. They can be run 2x standard steps, like 4-step can be run with 8, everything above is just useless or will degrade.

Basic WAN 2.2 can run in a range with 20-40 steps.

👀 Frame Interpolation

There are multiple techniques to add your extra frames. They all do a good job.

The most common are: RIFE, FILM, GIMM-VFI.

The standard is RIFE and very fast
FILM is higher quality, but slow
GIMM-VFI is the most advanced and highest quality, but needs significant computing time.

It will add up on resources needed and time to generate a video, but will not add much to the quality. Even worse, with to many interpolation you will get artifects.

If you really want to have that little bit of extra smoothness go with 1 interpolated frame, doubling your total frames in your given seconds.

Sometimes the wording inside the various applications differs. So in SwarmUI a multiplier by 1 means disabled. A multiplier by 2 means double frame rate.

On 5 seconds videos in medium (480p) till high (720p) resolutions the differences are very low.

My advice is to try without or try RIFE first.

There are methods to accelerate these interpolation models by using other technologies like "TensorRT" or "Torch". These can significantly speed-up the process, but will consume more resources.

🎏Up-casting

This process means you just fit your initial video in a other desired frame rate.

You did 81 frames video (5s in 16 fps) and up-cast it to 32 fps (without frame interpolation) you get that 81 frames with 32 fps resulting in 2.5s.

This will play your whole frames in just half the time, this will look fast, but will be short.

On the other hand if you have a 20s video and want it to play just faster in 10s without doing double the frames this would be a good method to do it.

🖼️ Upscaling

Upscaling is the process to scale a initial generated frame to an higher resolution.

Common upscaling methods are:

Pixel upscaler: Lanczos ... nearest-exact
- Fast, low resource consumption
Image-Model upscaler:
- Slow, high resource consumption
- E.g. ESGAN, RealESGAN
Advanced Video-Model upscaler:
- very slow, extreme resource consumption
- E.g. Hunyuan-Video1.5 SR

There are methods to accelerate these upscalers by using other technologies like "TensorRT" or "Torch". These can significantly speed-up the process, but will consume more resources.

📝 Basic examples on how to write prompts

Prompting checklist

Here comes the fun part. Remember every word counts! There is no absolute formula, but here are some basic ✅ Do's and ❌ Dont's I would recommend for the beginning:

✅ Write 1-2 sentences of the actual setting, even if you provide an initial image

✅ Provide details and choose precise words - "blue eyes" are better than just "eyes"; "white" is not "whitish"; ...

✅ Describe what should happen in active speech ~ing "doing, walking, blinking,..."

✅ Describe sequential actions separated by periods "."

✅ Get your right order, the order of the description matters a lot

✅ Be creative with wording, try to change "zoom" with "pen" if something is not working as expected

✅ Long videos (>5s) need more prompting or WAN 2.2 will start to guess and repeat or add things up

✅ Different resolution/aspect ratios may alter the outcome completely even with the same prompt, every setting matters - In the end WAN 2.2 is trained for 720p (16:9)

✅ Do some fast samples with low resolution (e.g. 368x624) to get a hint if your prompt and other settings are reflecting your desired outcome, than raise to your favourite resolution

❌ No 1-liner, it can, but mostly will not work

❌ Do not endlessly repeat a action, until you want a messy repetition of that action

❌ Don't let WAN guess things up, because you did not include it in the prompt

Prompt structure

The standard structure of a prompt could be like this:

Scene > Action > Mood > camera composition

Scene: A Woman with blue eyes and dinner dress is standing next to a round wooden table, she is holding a glass filled with milk.

Action: She is raising her hand with the glass to her mouth, drinking the mild in one sip.

Mood: The room is filled with warm light.

Camera composition: The camera is fixed. Still camera.

A Woman with blue eyes and black dinner dress is standing next to a round wooden table, she is holding a glass filled with milk. She is raising her hand with the glass to her mouth, drinking the mild in one sip.The room is filled with warm light. The camera is fixed. Still camera.

Testing your prompt

The prompt is the most impotent part. I render multiple prompts with low resolution to see if the video is going in the right direction. If things happen as expected I start using higher resolution, what will add details. So I can test if a prompt is good for the situation or not. The resolution has a good impact on the outcome, so a low resolution will not get the same result as a high resolution with the same prompt.

But the overall direction will still be the same. So you can elaborate and get as near as you can to your desired result.

☄️ Speed-Up tricks

They have all the same goal: Lowering the steps needed to get your result.

Just keep in mind this power comes from sacrifice. They all sacrifice quality against speed.

Sometimes more, sometimes less.

Here are my opinion on the common speed-up tricks.

SageAttention
- Will save you time, but not very compatible with other speed-up's
- Complicated to set up
- Need many initial steps <10 to bring good results, will save you time on succession
- It will make issues, introduce artifacts, and more ...
- Sometimes incompatible to LoRAs and other add-ons
- Will slightly reduce quality
- (I don't use this at all ... more problems than benefits)
Distilled/SelfForcing - aka Lightning (4-8 step LoRAs/Checkpoints)
- Minor quality loss, huge benefit
- This will cut your time needed in fractions of the original
- Extreme high compatibility to LoRAs and other add-ons
Alternate VAE/TAE
- There are alternate VAE (e.g. lightvaew2_1 from Lightx2Vid)
- Despite their claiming:
- They significantly degrade quality
- No measurable speed-up for WAN 2.2
- Some VRAM saving

💾 Memory management

Standard memory management

By default the models are loaded to RAM, not VRAM. When the model is used it will be moved to VRAM, either fully or partially based on the available VRAM.

This is automated, and models are offloaded if needed, but not always to reduce unnecessary moving of the weights.

Problems arise, when custom nodes circumvent this process, or mostly Windows specific issues with the accuracy of the memory requirement estimation.

Issue resolution

Best manual solution in this case is to launch ComfyUI with --reserve-vram <amount in GBs> argument to force bit more offloading and give it more room to work. For example:

--reserve-vram 1

--reserve-vram 2

To avoid

"Clean VRAM" nodes are more or less placebo and or can lead to memory leaks or breaking the internal memory management. Most are already blacklisted from comfyui itself.

Custom nodes, were a native node is available. Many custom nodes (even from famous authors like kijai) can have significant bugs or dirty code that will make things more worse for memory and processing.

Convoluted workflows - Only use what you need, every node that is loaded will use memory and resources, even if this seems not much it will add up.

Memory considerations

For WAN 2.2 and assuming Q8 or FP8 checkpoints ≥ 64 GB RAM and ≥ 16 GB VRAM are optimal.

Running in low-medium resolutions 32 GB RAM and 8 GB VRAM are sufficient.

For less RAM/VRAM you want to use lower quantisations like Q6 - Q2.

Monitor your resource consumption sometimes, to check that no or very few swap/pagefile is needed. - This would be not only cause extreme slowdown, it will also gradually reduce the lifespan of your SSD because of the massive data that is moved.

More RAM/VRAM is always better here.

🧺 Caching (memory cache implementations)

Both known caching implementations are a speed up by caching things whit successive steps. The most benefit arises if there are many steps, with a low step count the benefit is low. All these will degrade visual quality by an amount, more or less. They are a tool if you are fine with the quality sacrifice for speed. Mind, that on speed-up like 4 steps the inference speed gain is not as significant.

MagCache

Uses a natural pattern in how models work — the "magnitude" (size) of changes between steps decreases predictably — to decide when to skip steps.

Needs just one random prompt to calibrate, making it faster, simpler, and more reliable across different models and prompts.

TeaCache

Skips unnecessary steps in video generation by predicting when it's safe to do so, based on learned patterns from many prompts.

Needs calibration on 70 different prompts, which is slow and can overfit (works well only on similar prompts).

Comparison

Speed: MagCache is faster — up to 2.8× speedup vs TeaCache’s 1.6×.
Quality: MagCache keeps better video quality (higher SSIM, PSNR, lower LPIPS). TeaCache often causes blurry or distorted videos, especially in color and details.
MagCache is simpler, faster, and better quality than TeaCache. It’s more general, needs less setup, and works well across different video models.

🔝 GPU considerations (which GPU to use)

S-Tier

Nvidia

All Nvidia GPUs from the last 10 years (since Maxwell/GTX 900) are supported in pytorch and they work very well.

3000 series and above are recommended for best performance. More VRAM is always preferable.

Why you should avoid older generations if you can.

Older generations of cards will work however performance might be worse than expected because they don't support certain operations.

Here is a quick summary of what is supported on each generation:

50 series (blackwell): fp16, bf16, fp8, fp4
40 series (ada): fp16, bf16, fp8
30 series (ampere): fp16, bf16
20 series (turing): fp16
10 series (pascal) and below: only slow full precision fp32.

Models are inferenced in fp16 or bf16 for best quality depending on the model with the option for fp8 on some models for less memory/more speed at lower quality.

Note that this table doesn't mean that it's completely unsupported to use fp16 on 10 series for example it just means it's going to be slower because the GPU can't handle it natively.

Don't be tempted by the cheap pascal workstation cards with lots of vram, your performance will be bad.

Anything older than 2000 series like Volta or Pascal should be avoided because they are about to be deprecated in cuda 13.

B Tier

AMD (Linux)

Officially supported in pytorch.

Works well if the card is officially supported by ROCm but can be a bit slow compared to price equivalent Nvidia GPUs depending on the GPU. The later the GPU generation the better things work.

RDNA 4, MI300X: Confirmed "A tier" experience on latest ComfyUI and latest pytorch nightly.

Unsupported cards might be a real pain to get running.

AMD (Windows)

Official pytorch version that works but can be a bit slow compared to the Linux builds. Oldest officialy supported generation is the 7000 series.

Intel (Linux + Windows)

Officially supported in pytorch. People seem to get it working fine.

D Tier

Mac with Apple silicon

Officially supported in pytorch. It works but they love randomly breaking things with OS updates.

Very slow. A lot of ops are not properly supported. No fp8 support at all.

F Tier

Qualcomm AI PC

Pytorch doesn't work at all.

They are: "working on it", until they do actually get it working I recommend avoiding them completely because it might take them so long to make it work that the current hardware will be completely obsolete.

♨️ Source of GPU recommendations is comfyui

Congratulation you reached the END!

Now go and make that awesome art you are here for! ~ 💥

Remember to post something to the resource pages of the person you want to support by hitting this buttons:

Darksidewalker's WAN 2.2 14B I2V - Usage guide - Definitive Edition 📑