WAN 2.2 local lora training guide [Windows/Linux]

Intro

I've had a lot of questions about how I train my Loras locally using my poor overworked 3090, so I figured I'd write a quick guide so I can keep everything in one place. I use diffusion-pipe for training and it's not exactly user-friendly, so this guide assumes a baseline of technical experience such that you're comfortable typing commands into a terminal.

Training a WAN video loras is a fair bit more involved than training a SDXL/Pony/Illustrious/whatever image Lora, so if you haven't done an image Lora I'd suggest you try one of those first. There are plenty of excellent guides available online and much more user-friendly tools like kohya_ss and OneTrainer, and they only take 1-1.5 hours to bake rather than the 10-20 that WAN requires.

Setup

First thing first, we need to set up diffusion-pipe, the program that we will use for training. I followed SingularUnity's excellent guide for my initial setup; follow the guide until step 16: it will take you through enabling WSL, installing Ubuntu and diffusion-pipe, and updating your WSL environment. (If you are using Linux, skip steps 1-4 as they pertain to Windows installations)

We will now download the necessary WAN 2.2 files. From your diffusion-pipe directory, run the following commands:

Model repository (I2V)

huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir models/wan/Wan2.2-I2V-A14B

Model repository (T2V)

huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir models/wan/Wan2.2-T2V-A14B

Base model safetensors (I2V)

wget -P models/wan https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_i2v_high_noise_14B_fp16.safetensors

wget -P models/wan https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_i2v_low_noise_14B_fp16.safetensors

Base model safetensors (T2V)

wget -P models/wan https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_t2v_high_noise_14B_fp16.safetensors

wget -P models/wan https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_t2v_low_noise_14B_fp16.safetensors

LLM (UMT5) model

wget -P models/wan https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp16.safetensors

These files are massive and will take a while to download, I think everything above altogether is over 100GB!

Configuration

Now that all of our files have downloaded, we can configure diffusion-pipe. I'll attach sample configuration files which you can place in diffusion-pipe\examples. BE SURE TO UPDATE THE PATHS TO MATCH YOUR ENVIRONMENT.

These files are how you configure your dataset & training parameters including repeats, epochs, optimizer and learning rate. I'm no expert, but I have gotten good results with the included learning rates & optimizer settings. I generally prefer to have a high number of epochs and just 1 repeat per epoch.

The default datasets in diffusion-pipe/examples have excellent comments that describe what each of the settings do. If you have questions or encounter issues, check the examples first!!! I have removed many the comments from the example files so it's easier to see what configs I have actually set, but I would encourage you to use the original files with comments included.

Dataset collection & preparation

I train my loras on small square videos that are 256x256 pixels resolution, typically 4-8 seconds long. I've found that 20 clips is a good amount, though you could probably go as low as 12-15 if they are high enough quality. (By "high quality" I mean that the clips are representative of the motions you want to reproduce, not high resolution/framerate/whatever).

There are probably professional tools that can trim videos, but I do it the caveman way by playing the clip and using the builtin Windows screen capture tool (Win + Shift + S) to capture a 256 square section of the screen. (This also ensures that the clips are all of the same framerate).

They are then captioned. Simply create a .txt file with the same names as the video: the caption for 1.mp4 goes in 1.txt. Ironically I haven't found a good AI tool that will accurately caption nsfw videos yet so I do this manually. It sucks, but is very important. You describe who is doing what, where. I've found that being very literal and not overly descriptive leads to good results (I have edited the bolded words to ensure that this guide remains PG):

The video shows an Asian woman with long brown hair. She is completely clothed and has large eyes.

She is bouncing up and down, having fun with a man in the recovery position. The view is POV. His credit card is sliding in and out of her wallet. She is leaning back slightly, placing her arm on the man's thigh to support her body weight. The woman is aggressively moving her hips up and down, slamming the card inside her wallet. The card goes completely inside the wallet, then comes back out. They are having fast, rough, intense fun.

The video focuses on her wallet, eyes, and face.
As the video progresses, she moans and tilts her head as an expression of ecumenical pleasure.

The background has a white bookshelf with a plant and a few books. The scene takes place indoors on a bed.

I would use the above as a template, changing mostly the description of the actors in paragraph 1, the scenery in paragraph 4, and any outlier/exceptional actions in paragraph 3. The description of the action in paragraph 2 is mostly unchanged from clip to clip.

Training

Now that diffusion-pipe is installed, configurations are configured, and a dataset is created & captioned & moved to the folder indicated in dataset.toml, it is finally time to begin training.

Simply run the deepspeed command with your config files to start training:

deepspeed --num_gpus=1 train.py --deepspeed --config examples/wan_video_high.toml && deepspeed --num_gpus=1 train.py --deepspeed --config examples/wan_video_low.toml

The attached example configs work for me with a 3090 and 32GB of RAM, you will need to adjust the configs for your system. If you get out of memory errors (which should appear about 2 minutes after you run the training command), try adjusting the blocks_to_swap option.

And that's it! If all goes well, in 10-20 hours diffusion-pipe will have generated Loras called adapter_model.safetensors in whichever directory you specified in your settings tomls. Now generate some videos and post the best ones to Civitai so we can all benefit!

Shameless plug

All of my models, guides, and images will always be free, and I do my best to answer questions in comments or DMs. Everything I have learned about image generation I have learned from free resources that others have taken the time & effort to create, so it wouldn't be right to turn around and put things behind a paywall. That said, if you like my models or this guide and feel like subsidizing my electric bill, I would be thrilled if you would consider buying me a coffee.

Outro

That should be it! Let me know in the comments if you have any questions, feedback, or suggestions. I plan on continually updating this guide as I learn more, so do check back from time to time.