Sign In

"WAN2.2 i2v" and "AI-Toolkit" Beginner's Training Guide (For Local PC Training)

5

"WAN2.2 i2v" and "AI-Toolkit" Beginner's Training Guide (For Local PC Training)

It is a machine-translated article from Japanese to English, so we would appreciate it if you could point out any typos or errors.

There are no restrictions on translating or reproducing this article, and there is no need to report the reprint.

This article was written in March 2026, and the information contained herein will be outdated in a few months.



Prerequisites

This article explains how to training on a local PC using WAN2.2 I2V.

The training tool is "AI Toolkit," which is installed using "StabilityMatrix".

    

🚩If you only want to see the training settings, please refer only to ■Chapter 5■ (●Create JOB).    

⚠️The content of this article may be updated periodically based on the author's experience.



■Chapter 1 - Physical Environment -■

WAN2.2 training has very demanding operating environment requirements.

As a practical minimum requirement, a CUDA-compatible GPU with 32GB or more of memory is required.

For local training, the minimum requirement is either an "Nvidia 5090 RTX (32GB)" or an "NVIDIA RTX PRO 4500 Blackwell (32GB)".

00014-2807205323.png

■Chapter 2 - Software Environment -■

"StabilityMatrix" is ​​the ideal environment for trying out the latest AI tools.

Installing various packages is easy.

Updates are also easy.

However, for full-scale operation, it may be better to prepare an application environment separate from "StabilityMatrix" via GIT for various reasons.

⚠️For the most up-to-date installation guide, it might be best to consult an AI.

 

1. "StabilityMatrix" Installation Guide

GitHub: https://github.com/LykosAI/StabilityMatrix

Guide(JP): https://shichisan-blog.com/stability-matrix_dounyuu/

 

2. "AI-ToolKit" Installation Guide

GitHub: https://github.com/ostris/ai-toolkit

Guide(JP): https://note.com/aiaicreate/n/nd4f8e7d95efe

 

3. "HENTAI APP" Installation & User Manual Guide

When creating clip videos of datasets, I use the HENTAI APP.

GitHub: https://github.com/akitoshi1/HENTAI_APP

Guide(en): https://civitai.com/articles/26629/create-wan22-datasets-from-videos-optimized-software-has-been-released

00022-1778083124.png

■Chapter 3 - HIGH or LOW -■

Conclusion first:

  1. For simple motions, prioritize HIGH .

  2. If detailed additional rendering is required, prioritize LOW.

 

Before starting the training, you need to reconfirm the special specifications of WAN2.2.

That is, the existence of "High noise" (hereafter "HIGH") and "Low noise" (hereafter "LOW").

WAN2.2 has gained very flexible video generation capabilities by dividing the model into HIGHand LOW.

However, users need to be very aware of HIGH and LOW, making it a very difficult specification.

Furthermore, for model trainers, this model division specification is truly crap, a towering pile of crap. Crap!

  • HIGH: Generates overall motion such as camera work and character movement motion.

  • LOW: Generates detailed rendering such as object details and additional objects.

 

Before training a model,

  • Is it a motion model that prioritizes HIGH ?

  • Is it a detailed rendering model that prioritizes LOW?

 

Please be sure to understand this before beginning to create your dataset.

This HIGH or LOW recognition will have a significant impact on the model's training results.

 

 

This article presents the following two models as examples.

[Explanation Sample] HIGH Model

Walking Rotation(WAN2.2 I2V 14B) Loop walking concept

image.png

 

[Explanation Sample] LOW Model

Drift stop from close-up face (WAN2.2 I2V) (ANIME) AKIRA in Kaneda's Bike Drift Stop

image.png

■Chapter 4 - Creating Dataset -■

Conclusion first:

  1. If you have 32GB of GPU memory, prepare video clips with playback times of 3 seconds or less.

  2. The starting point is of utmost importance; be sure to specify a starting point that matches the starting image specified in I2V.

  3. HIGH model, 5 clips or less are sufficient.

  4. LOW model, prepare 10 to 20 clips.

 

  1. HIGH and LOW Dataset Sample

https://mega.nz/folder/1gRzUAoY#sYIcL4lPvPZWKaeeZ--VTw

 

  1. HIGH Dataset Sample (4clip)

image.png

LOW Dataset Sample (21clip)

image.png

● How to Generate Clips

For the WAN2.2 dataset, you specify a video file.

There are two main approaches to preparing video files:

  • Extract clips from existing videos

  • Create clips using WAN2.2.

  

I handle each case individually.

  • For HIGH models that prioritize motion and camera work that cannot be reproduced with WAN2.2, I extract clips from existing videos using the HENTAI APP.

  • For LOW models that can be reproduced with WAN2.2 but require detailed rendering of the final frame, I generate clips by specifying the start and end images in WAN2.2.

● Clip Playback Time

Please try to keep the clip playback times as consistent as possible.

If there are variations of even a few seconds in playback time, it may have a significant impact on the model's quality.

If your GPU memory is 32GB, 4 seconds is the limit.

Even at 4 seconds, you should expect considerable degradation due to dropped frames.

Please try to unify the playback time to within 3 seconds whenever possible.

  

● Number of Clips

There is no definitive answer regarding the appropriate number of clips. The following are my guidelines based on experience.

  • For HIGH models: 1 to 5 clips should be sufficient. For truly simple models, even 1 clip is enough.

  • For LOW models: Prepare models with 10 to 20 clips. Increasing the number of clips doesn't seem to contribute much to diversity.

 

●Clip Content

Naturally, a dataset of disjointed clips will generate a disjointed model.

  • The starting point of the clip will be the same as the starting image of the I2V model.

  • When extracting clips from a video, carefully select a starting point that matches the starting image of the clip.

  • For the clip content, focus on making the parts you want to reproduce the same movement and rendering.

  • For parts you don't want to reproduce, make them as different as possible.

 

●Regarding video captions

Specify a common positive prompt when generating videos on WAN2.2.

If there are differences in the starting image, please specify this as a difference.

(e.g., Fullbody or Cowboy shot, Standing or Sitting)



■Chapter 5 - Training with AI-Toolkit -■

●Datasets

Place the dataset in the "Dataset" folder of AI-ToolKit.

In this example article, the path would be as follows

D:\StabilityMatrix\Packages\AI-Toolkit\datasets

image.png

  

Click "Datasets" in the menu on the left.

Verify that you can select the dataset you want to use for training in the list that appears on the right.

image.png

Check each video and its caption in the dataset.

You can edit captions and delete clips from here.

image.png

   

 

●New JOB

Click "New Job" in the menu on the left.

📘For a detailed explanation of each parameter, please refer to this article.

https://www.runcomfy.com/trainer/ai-toolkit/wan-2-2-i2v-14b-lora-training

  

The initial state will be displayed as follows:

image.png

 

🚩This article only describes the parameter settings for a local PC (GPU 32GB).

This section explains common items that should be changed from their default values.

1️⃣JOB

image.png

Training Name:

Please enter the model name to be output. (The names "_high_noise" and "_low_noise" will be assigned automatically.)

 

Trigger Word:

Please set a common prompt to specify when generating a WAN2.2 I2V connection.


2️⃣MODEL

image.png

Model Architecture:

Select "WAN 2.2 I2V (14B)"

 

Options-Low VRAM:

image.png

3️⃣Quantization

image.png

Transformer:

Select "4 bit with ARA"


4️⃣Multistage

Case of HIGH Model

Switch Every:

Set '10'

image.png

 

Case of LOW Model

Switch Every:

Set '35'

image.png

5️⃣Target

image.png

no changes.


6️⃣Save

image.png

no changes.


7️⃣Training

Case of HIGH Model

Steps:

Set '500'-'1000'

( I specify 500, and if that fails, I specify 1000. It might be wiser to specify 1000 from the start.)

image.png

 

Case of LOW Model

Steps:

Set '1500'-'2000'

( I specify 1500, and if that fails, I specify 2000. It might be wiser to specify 2000 from the start.)

image.png

 

Steps:

⏳500 steps: 3 hours...

⏳1500 steps: 10 hours ...

⏳2000 steps: 15 hours ...

⏳3000 steps: 24 hours ...

Time is the most expensive cost. Train smartly.


8️⃣DATASETS

image.png

Target Dataset:

Please select the dataset you wish to use for training.

 

Num Frames:

For clips of 1 second: 18

For clips of 2 seconds: 33

For clips of 3 seconds or more: 39-42

If you have 32GB of GPU memory, you can specify 42... however, depending on the number of clips, training may exceed the physical GPU memory.

For safety, I specify the maximum value as 39.

 

Resolutions:

image.png

9️⃣Sample

image.png

 

Advanced Sampling:

image.png

⚠️If Disable Sampling is not set to "ON", a fatal error will occur at the start of training.


Create Job

Once you have finished setting the parameters, please click "Create Job".

image.png

 

Click the "▶️" button in the upper right corner of the screen to start the training.

image.png

  

Once the progress bar shows "1," your training has begun.

Congratulations!!

Now all you have to do is wait!!

image.png

⚠️As of March 2026, a fatal error occurs when starting training in the initial installation state of AI-Toolkit🤪🤘.

・Error running job: Failed to import diffusers.schedulers.scheduling_dpmsolver_multistep because of the following error (look up to see its traceback):

image.png

Solution: The most reliable solution is to downgrade NumPy to the stable version 1.26.4.

1.Please select AI-Toolkit from the Packages tab in StabilityMatrix.

image.png

2.Click the "︙ (ellipsis)" in the upper right corner.

image.png

 

3.Open Python Packages and search for numpy in the list.

image.png

4.Change (downgrade) the version to 1.26.4 and apply.

image.png

・Error running job: The size of tensor a (36) must match the size of tensor b (16) at non-singleton dimension 1

image.png

Solution: Please set "Disable Sampling" in the sample to ON.

image.png

⬇️

image.png

I hope this annoying error is fixed as soon as possible.


■Chapter 6 - What is WAN2.2 Training? -■

WAN2.2 training requires an enormous amount of time.

⏳500 steps: 3 hours...

⏳1500 steps: 10 hours or more...

⏳2000 steps: 15 hours or more...

⏳3000 steps: 24 hours or more...

 

During this time, GPU utilization will be stuck at 100%, and with an NVIDIA 5090 RTX, GPU power consumption will be stuck at 550W.

Of course, the PC cannot be used for any other purpose during training.

  

What is gained from this?

WAN2.2 is a platform that is already becoming a legacy model.

WAN2.2 training skills will quickly become obsolete.

Nevertheless, WAN2.2 possesses a "freedom" that other platforms lack.

That freedom granted me unlimited and unrestricted video generation capabilities.

I believe Illustrious's image generation has granted you unlimited and unrestricted image generation capabilities.

Wouldn't you like to acquire unlimited and unrestricted video generation capabilities next?


■Finally■

Take Wednesday off from training.

Otherwise, Windows Update will ruin everything and drive you crazy...

00046-3828026895.png

全てのCivitAI関係者とユーザーに、心から敬意と感謝を表明します。

CivitAIが、愛と自由と平和と平等、HENTAIに溢れた最高にROCKな場所であり続けることを、心から応援しています。

thank you !!

(suteakasuteakasuteka434)

5