Sign In

Ryuko Matoi šŸŽ„ Wan2.1-T2V-14B

24
116
5
Verified:
SafeTensor
Type
LoRA
Stats
116
0
Reviews
Published
Mar 15, 2025
Base Model
Wan Video
Training
Steps: 15,050
Epochs: 35
Usage Tips
Strength: 1
Trigger Words
Ryuko-chan
Training Images
Download
Hash
AutoV2
110B060966
seruva19's Avatar
seruva19

Description

"To Hell With Your Opinion. I'll Take My Own Path No Matter What Anyone Else Says." - Ryuko Matoi

Usage

The trigger word is "Ryuko-chan".

There is no specific prompt format I use (still experimenting), but I usually start all prompts with "High quality 2D art animation." For example, "High quality 2d art animation. Ryuko-chan, with long red hair and a black trench coat, stands on top of a ruined skyscraper. The city burns in the distance, smoke rising into the dark sky. The wind makes her coat billow as she crosses her arms. The camera suddenly zooms in on her determined face. She smirks, tilts her head slightly, and winks."

For inference, I use Kijai's wrapper. Unfortunately, Civitai does not seem to parse metadata from it, but the workflow should be embedded in each clip. Just in case, here's an example workflow in JSON format. It's still a work in progress, so the right side is a bit messy, but overall fully functional.

All clips I post are the raw output of the LoRA, I do not use upscaling or frame interpolation (they misrepresent the LoRA's true capabilities). The following parameters are constant across all clips:

Sampler: unipc
Steps: 20
Cfg: 6
Shift: 7

If you check my workflow, you'll see that I use all possible optimizations introduced since Wan2.1 сame out. And I really want to express my appreciation to the software developers and ML engineers (Comfyanonymous, Kijai, TeaCache team, and many others) who made it possible to run Kling-level video models on a consumer GPU with satisfying speed and (almost) no compromise in quality.

I started with "pure" FP8 + SDPA, and below is a breakdown of the speed improvements I achieved with each optimization technique.

For reference, my setup is: RTX 3090, 64 GB RAM, Win 11, Python 3.11.6, Torch 2.7.0.dev20250311+cu126, Sage Attention 2.1.1, Triton 3.2.0, NVidia driver 572.16, ComfyUI portable v0.3.26.

The times I mention below are not for rendering showcase clips in the gallery, most of those are 640x480x81, which adds about 2 minutes to the total time. But that's fine since I usually launch 50-60 prompts at once and check back few hours later to collect the results.

However, the times that matter most to me are for running Wan-14B T2V during testing and comparing different LoRA versions. During this phase I use 512x512x65 dimensions. 99% of these test clips are put to recycle bin immediately after being generated, but they provide a clear assessment of LoRA quality, so I need to make test-time inference as fast as possible.

For 512x512x65 clip, 20 steps, UniPC sampler:

(Only the DiT inference phase is considered, as optimizations do not affect the time required for encoding prompts with UMT5 or decoding latents with VAE. Additionally, this time is negligible, taking only 2ā€“3 seconds.)

  • fp8_e4m3fn + sdpa -> 09:24 (28.24s/it)

  • fp8_e4m3fn + sageattention 2 -> 06:53 (20.70s/it) +36% (1.36x)

  • fp8_e52m + torch.compile + sageattention 2 -> 06:21 (19.08s/it) +48% (1.48x)

  • fp8_e52m + torch.compile + sageattention 2 + teacache (0.250/6) -> 04:29 (13.50s/it) +109% (2.09x)

  • fp8_e52m + torch.compile + sageattention 2 + teacache + fp16_fast -> 03:23 (10.19s/it) +177% (2.77x)

So, almost x3 speedup, with minimal quality loss (well, for my use case; your experience may differ). I compared clips using the same seed, rendered with and without TeaCache/Sage Attention 2, and honestly, I couldn't see a clear drop in quality. Maybe there's a slight difference in very complex prompts with a lot of motion, but even in "raw" mode, those tend to struggle anyway. If it matters, I also make extensive use of Enhance-A-Video and SLG (layer 9, 0.2-0.8). They seem to have a positive impact on clips, improving quality and mitigating motion artifacts.

The speed boost was one of the major factors in my decision to switch from HV to Wan. While Wan provides better quality, I was used to fast rendering in HV, so getting similar speeds with Wan made the transition worthwhile.

Training

I used 215 images (most of them were manually selected from 18461 screencaps of all 25 episodes of Kill la Kill, plus a few official artworks). They were captioned with THUDM/cogvlm2-llama3-chat-19B, using the following prompt:

"Describe this artwork in detail, focusing on the visual style, setting, atmosphere, and artistic techniques. When a female character with dark hair is present, refer to her as Ryuko-chan. Mention her clothing, accessories, and pose, but do not describe her facial features, body proportions, or physical attributes. Include specific art style terminology (e.g., cel-shaded, painterly, watercolor, digital illustration) and visual elements that define the aesthetic. Describe lighting, color palette, composition, and any notable artistic influences."

My idea was to caption in a way that would allow me to freely alter Ryuko's clothing and hair color, while keeping all other aspects of her appearance (facial features, body, etc.) authentic and recognizable. And this actually worked, WanVideo did an amazing job on its part (as I hope you can see from the examples I posted). Unfortunately, gear-shaped pupils were not memorized, but I did not have enough of close shots in dataset.

I didnā€™t caption anything by name besides Ryuko, so while the model likely learned visual aspects of Senketsu and the Scissor Blade, explicitly calling them out may not work (though I haven't tested it much). But this was intentional - I wanted to teach the model only about Ryuko and her physical appearance.

If her outfit and hair color are not explicitly described, they will most likely default to Ryuko's original clothing. In everyday scenes, she will wear her regular uniform, while in battle and intense scenes, she tends to prefer Senketsu.

(Initially I planned it create a character-only LoRA, not style, so I could also draw Ryuko in realistic manner, but the dataset bias was too strong, so it can only render in style of the original series, which actually isn't a bad thing. That said, itā€™s probably more precise to refer to it as a mixed character/style LoRA.)

First version of LoRA was trained for 1.3B model with diffusion-pipe, but I didn't like the result, 1.3B model (imho) is too small and can't really compete in quality with HV (the only advantage is speed and requirements). Second version was trained with ai-toolkit, but the LoRA also didnā€™t turn out as well as I had hoped (I attribute this to the fact ai-toolkit yet doesnā€™t support training with captions).

Finally, current (successful) version was trained for 14B model with musubi-tuner. Below are commands I used (I typically create .bat files for launching musubi-tuner pipelines, hence the format of the train command). For a brief breakdown, almost all important parameters are the same as the default, except the learning rate, which was changed to 7e-5. I trained for 40 epochs (17630 steps, with 430 steps per epoch) and, after three days of testing, selected the checkpoint at step 15050 as the most successful. I could have trained for more steps because the model seemed to be improving, and the dynamics of the training loss were promising.

  • cache_latents (vae):

python wan_cache_latents.py --dataset_config G:/samples/musubi-tuner/_ryuko_matoi_wan14b_config.toml --vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors
  • cache_prompts (t_enc):

python wan_cache_text_encoder_outputs.py --dataset_config G:/samples/musubi-tuner/_ryuko_matoi_wan14b_config.toml --t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth --batch_size 16 

  • dataset config (I did not write it by hand, made small script that scans folder with image files, calculates a scaled resolution for each folder's images so that max dimension of scaled image does not exceed 720px, preserving approximate aspect rate while ensuring the dimensions are divisible by 16 - but does not resize it physically, just preparing ahead for musubi-tuner's bucketing mechanism - then groups images with same dimensions into subfolders and generates a YAML file with metadata for musubi-tuner):

[general]
caption_extension = ".txt"
batch_size = 1
enable_bucket = true
bucket_no_upscale = false

[[datasets]]
resolution = [528, 768]
image_directory = "H:/datasets/ryuko_matoi_wan_video/1057x1516x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/1057x1516x1/cache"
num_repeats = 2

[[datasets]]
resolution = [768, 432]
image_directory = "H:/datasets/ryuko_matoi_wan_video/1280x720x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/1280x720x1/cache"
num_repeats = 2

[[datasets]]
resolution = [592, 768]
image_directory = "H:/datasets/ryuko_matoi_wan_video/1600x2033x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/1600x2033x1/cache"
num_repeats = 2

[[datasets]]
resolution = [400, 768]
image_directory = "H:/datasets/ryuko_matoi_wan_video/1727x3264x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/1727x3264x1/cache"
num_repeats = 2

[[datasets]]
resolution = [720, 768]
image_directory = "H:/datasets/ryuko_matoi_wan_video/1917x2002x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/1917x2002x1/cache"
num_repeats = 2

[[datasets]]
resolution = [768, 432]
image_directory = "H:/datasets/ryuko_matoi_wan_video/1920x1080x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/1920x1080x1/cache"
num_repeats = 2

[[datasets]]
resolution = [736, 768]
image_directory = "H:/datasets/ryuko_matoi_wan_video/1920x1963x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/1920x1963x1/cache"
num_repeats = 2

[[datasets]]
resolution = [480, 768]
image_directory = "H:/datasets/ryuko_matoi_wan_video/1920x3038x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/1920x3038x1/cache"
num_repeats = 2

[[datasets]]
resolution = [768, 576]
image_directory = "H:/datasets/ryuko_matoi_wan_video/2363x1813x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/2363x1813x1/cache"
num_repeats = 2

[[datasets]]
resolution = [768, 624]
image_directory = "H:/datasets/ryuko_matoi_wan_video/3877x3208x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/3877x3208x1/cache"
num_repeats = 2

[[datasets]]
resolution = [640, 768]
image_directory = "H:/datasets/ryuko_matoi_wan_video/690x820x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/690x820x1/cache"
num_repeats = 2

[[datasets]]
resolution = [576, 768]
image_directory = "H:/datasets/ryuko_matoi_wan_video/690x920x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/690x920x1/cache"
num_repeats = 2

[[datasets]]
resolution = [512, 768]
image_directory = "H:/datasets/ryuko_matoi_wan_video/800x1195x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/800x1195x1/cache"
num_repeats = 2

[[datasets]]
resolution = [768, 496]
image_directory = "H:/datasets/ryuko_matoi_wan_video/935x608x1"
cache_directory = "H:/datasets/ryuko_matoi_wan_video/935x608x1/cache"
num_repeats = 2
  • sampling file:

# prompt 1
Ryuko-chan with blonde hair is walking on the beach, camera zoom out.  --w 384 --h 384 --f 45 --d 7 --s 20

# prompt 2
Ryuko-chan dancing in the bar. --w 384 --h 384 --f 45 --d 7 --s 20
  • train command:

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 wan_train_network.py ^
    --task t2v-14B ^
    --dit G:/samples/musubi-tuner/wan14b/dit/wan2.1_t2v_14B_bf16.safetensors ^
	--vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors ^
	--t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth ^
	--dataset_config G:/samples/musubi-tuner/_ryuko_matoi_wan14b_config.toml ^
	--sdpa ^
	--mixed_precision bf16 ^
	--fp8_base ^
	--fp8_t5 ^
    --optimizer_type adamw8bit ^
	--learning_rate 7e-5 ^
	--gradient_checkpointing ^
    --max_data_loader_n_workers 2 ^
	--persistent_data_loader_workers ^
    --network_module networks.lora_wan ^
	--network_dim 32 ^
	--network_alpha 32 ^
    --timestep_sampling shift ^
	--discrete_flow_shift 3.0 ^
    --max_train_epochs 50 ^
	--save_every_n_epochs 1 ^
	--seed 42 ^
    --output_dir G:/samples/musubi-tuner/output ^
	--output_name ryuko_matoi_wan14b ^
	--log_config ^
	--log_with tensorboard ^
	--logging_dir G:/samples/musubi-tuner/logs ^
	--sample_prompts G:/samples/musubi-tuner/_ryuko_matoi_wan14b_sampling.txt ^
	--save_state ^
	--sample_every_n_epochs 1

During training the speed was around 4 s/it, and VRAM usage was around 21GB. I didn't apply any specific optimizations aside from the --fp8_base and --fp8_t5 flags.

Compatibility testing with other LoRAs has not been conducted (and is not planned).

Also I did not test it with I2V models (in fact, I haven't even downloaded any of them yet).

(Oh, and I also published dataset alongside LoRA, but there is nothing notable in it.)

Conclusion

This is my first (successful) LoRA for Wan Video 2.1-T2V 14B, and I can say I feel very excited about this model. It has been a pleasure to train, and it grasps concepts and styles exceptionally well (based on my current, limited experience). I can't wait to train all the LoRAs I have planned! Until now, I have only trained generative AI models for style, but now I feel very enthusiastic about training models for not only style but also VFX, concepts, movements, etc.

Returning to this LoRA, I canā€™t say it's perfect, but out of the 3 clips I generate, 1 is successful (i.e., it adheres to the prompt and doesn't have undesirable artifacts), which I consider a success. I attribute this to the model itself. Although I trained the LoRA on images only (for upcoming LoRAs, I will also use video clips), the model learned (or rather, extrapolated) a lot about animation techniques and visual features of the original series. This includes, for example, the excessive (and not always appropriate xD) use of lens flare in outdoor scenes, exaggerated facial animations, etc.

I also think this model benefits from diverse datasets (at least 100 images) and low learning rates (1e-4 is too high). For my next LoRA, I will use no more than 5e-5. It might require more steps, but in this case, it learns all the details better without overfitting.