Part Two: Training and Testing
Introduction
If you haven’t yet please go read part one of my guide here. That guide will go over how to prepare the dataset and the technical details behind how Wan 2.2 high / low work to generate a video. So I’ll speak assuming you know the information from that guide first.
Where to start: The high model
Unlike say a traditional character lora (of a real person), the high model is crucial for a wan 2.2 anime style lora to work properly. To review: the high model lays the foundation for the low model to add details to it. It is in charge of the general shape and movement of things in the scene. Then the low model needs to work on top of what the high model has already laid out.
It’s like painting a christmas tree. If you start out with a sketch for a christmas tree, then you lay out the line work with a few triangles and a square, finally you can paint over the top quite easily. But if you start out with 2 circles and a pentagon, you will have to work hard to make a wonky looking christmas tree to fit the details with the painting. Here is a crude visual of what I mean:

“A” would be a well trained lora on what a christmas tree is (assuming wan doesn't have it in the training data for the example's sake). “B” would be a poorly trained (or no lora) for the High model. Coloring in the green leaves and brown stump is much better because the general shape is already there. “A” has the christmas tree shape in the final form. Take this same concept and apply it for motion, color, shape etc. The more work the low model has to take on setting these up, the more inaccurate the output will be.
Though on the other side. If you are training in a concept already in the base model of wan (like in actuality a christmas tree for example), the high model can lay out what the shape and color is, and you would not need to train the high model. This is often the case when making character loras of real people. Since WAN knows what a person is, just not the details of that person (details which are taken care of by the low model). Actually in that case, the high model is super easy to over train, and you may be fine with a lower learning rate and epoch 1 for the high model.
I’ll give a real example below showing how big an impact the high lora has on training. Using my recently made “Scooby Doo Mystery Incorporated Inc.” style lora. Consider an epoch around 800-1000 steps for this dataset.

You can see here with no High lora, that the result is an uncanny mix of real life and the cartoon style. The details are mostly right but the style is not right at all.

High 1 epoch (around 1K steps), no low epoch. The animated style is triggered, but its nothing like our scooby doo mystery inc show style at all. It looks like generic anime/cartoon mashup.

Adding in a Low lora at epoch 30 on top of high epoch 1 adds in the details much better. It’s much closer to what our show is. But some things are still off. Velma’s hair is poofy, clothes slightly different. This isn’t bad but it can get better.

Here we see a drastic jump in accuracy. Velma’s hair is the right shape and we’re seeing the bow.

Again another slight improvement, Velma’s glasses get a more transparent look. Her sweater has the ribbings on the collars and sleeves.

The final biggest jump. The iris of the eyes no longer has the white part around them (just like in the show. The clothes match the shape of the characters. Their faces are less round and have more pointy sharper curves.

Here is a screen capture from the show for comparison to illustrate what the characters should look like. You can see the hair and glasses style, and the ribbed shape of her collar. That’s how those are supposed to be generated. Missing are freckles and mouth style for Velma. Freckles can be prompted in, and the lips are hit or miss. I made those trade offs rather than trying to get the model to figure it out. For Daphne you can see 1:1 for the most part.
Understanding: The low model
Here is our high model from the previous example, at epoch 43 with a low trained model at various epochs (consider an epoch around 800-1000 steps of training for this dataset).

First of all, if we have no high epoch, and put in a well trained low epoch lora, we will see that the style does not trigger. As explained in the last section.

Here is Epoch 5 (4-5K steps), we can start to see some details but again, they’re all wrong. It just looks generic. Even with a well trained High lora, the details area all wrong (eyes especially and colors).

I’m going to skip ahead a bit, because you get the point, but at epoch 24 on low model we can see the details get more fine grained, the eyes are looking right with less white around the iris. More flat colors rather than gradient shading. Less rounded faces/lines (more sharp cheeks and chins for example).

Again we skip ahead to Epoch 30, this where I more or less felt we had enough to be near the style. Much closer to the style of the show with a few sacrifices here and there such as velma’s mouth. There are sometimes some issues, like she has an extra leg, but that’s more of a seed/prompting issue. If you do notice lots of disfigured people that’s a sign of over training but I don’t think so in this case. We can overcome some of the issues with just some seed hunting.
How to evaluate: Tensor graphs not so helpful.
You can look at tensor graphs, but in general just understand that a high model graph is going to look like a smooth C shape that over time will eventually flatten out to a straight line. In my case, I often end up over fitting much much sooner than before the curve straights out into a line. Here is a crude representation of a tensor graph. I don’t want to bother setting up the tensorboard for this because I don’t use tensor graphs at all anymore really.

It’s like this, it will flatten at some point but if you have lots of data stick with same adamw_optimi that I use, then you probably will over train before that point. If you use something like automagic then you will probably get more benefit out of the graph as it will flatline before overfitting probably. And the low will look like this

The low has these periods where it stables out slightly but goes up and down with a big jump downward. So it slowly makes a slope over time. And in theory will flatten at some point. I over trained a low before like 200K+ steps and it overfit before flat lining. I am of the opinion you can go without it flight lining like the high does. Just know that if it consistently curves upward then something is wrong with your settings or dataset.
At any rate, what optimizer or how to evaluate graphs is not what this guide is for. I recommend to use tensor graphs in wan 2.2 for one thing only, and that is just the trend. Make sure it does not trend upward or be over-erractic in loss over time. That is a sign that your training settings or dataset has something wrong (for example too high or low a learning rate, or maybe caption data is wrong, corrupted files etc). Also the loss # doesn’t mean anything anymore. It used to be you aimed for 0.02 or so loss, but I had great loras at over 1 loss before, just around 0.9 etc. So for Wan 2.2 the rules don’t apply anymore for loss as #, just make sure it trends downward, even at a 0.001 rate over time is fine.
Now, if you are training in wan 2.1, the tensor graphs and target # of loss are very helpful, and you can follow other’s advice for training on wan 2.1 for that, but I won't cover that here.
Training Methodology: High vs Low
Now the question I always get is, which model to start and how long to train? It’s very subjective. But the best way, in my opinion, is to start with the high. And train and check in every now and then with a sample generation and look how its appearing in the preview. Like below (an example of the high at 4/6 steps)

You can see its blurry but the general shape, color, (and motion if thats part of what you’re training) is there. If its not generating the right shapes for like the hair, eyes, sweater etc. then you will have the weird looking christmas tree. Like our example from before.
Below is an example from high epoch 1.

We can see the issues already, our foundation shapes / colors are still not there yet. It needs more time. We don’t need to see how the low will work with this yet because we know its not ready. So we will train the high to epoch to lets say epoch 16 or epoch 30. Then lets check in on the low epoch.
Train the low epoch to a similar #, lets say epoch 17, with our high epoch 30:

Left is high model epoch 30, right is low model epoch 17. We can see the details are being added on top of the high model and its clearing up things like our eyes for example to match the style.
Final result of high 30, low 17 epochs:
Looking good. But there is some general shape issues, I would now go back to the high and train that some more to get like shape of the faces better etc. But now we are already in a better place.
Here is high 43, low 30:

Good enough for me!
You will want to start with high, check in with low, then go back to high. Once you feel high is in a good place just keep training the low until the details are where you want them to be and that is when you should stop.
In conclusion
The high is gradual large changes which are important. The low is very sensitive because it relies on the foundation being set by the high model. You will have more control over finding the best place to stop training by starting with the high as your baseline rather than the other way around. Over time I figure it's faster to just overtrain the high and then start working on the low. you can always go back to earlier epochs on the high. but if you wanna be precise you can jump back and forth between training low and high. Apply the same principles for motion if that's what you're also training for.
Regarding settings for generating: Speed ups
All the examples here are done using lightning/lightx loras. But you want to make sure FIRST that your lora works without them. These speed up loras change the styles drastically. You will want to know your style activates and does well without these loras because the speed up loras always come out with new versions and some people use old versions. You will be too dependant on something you cannot control. Get it working without speed ups, then use speed ups if you like (I often use the speed ups).
Regarding settings for generating: Sampler settings
I tend to like dpm++_sde. Euler is too sensitive and doesn’t work with the high at all pretty much. I like 6 steps with lightning loras instead of 4. Shift 5 works best for me. But these settings are subjective and play around with them yourself. You can also interpolate to get more frames, but just understand that it may look sped up. I personally interpolate to 32 frames though.
About triggering styles:
Sometimes, certain prompts will not trigger the animated style. These cursed prompts cannot be helped. Maybe something in the wan base model just takes strength over the lora’s own trigger. In that case re-write the prompt differently. Or as a last hail marry just put the word “animated” at the end of the prompt. I don’t like doing this though as I think it makes the high model use its own biased 3d style animated look and feel over the style. A well trained model should trigger the style without prompting. And I am of the mind that a trigger word will help, and if you are training characters into your style lora, they will also act as trigger words if you prompt them.
Here is an example prompt style I have been using lately
ScoobyDooStyle, A sunlit grassy roadside with a vintage "Mystery Machine" van is parked in the background. Two young women sit in the grass near the bike, framed in a medium-wide shot. The woman on the left leans back on one hand and lifts her other hand to tuck stray hair behind her ear. Her posture shifts slightly, and her skirt moves gently with the motion. The woman on the right sits cross-legged and holds a half-eaten sandwich in one hand. As she looks over at her friend, she raises her free hand mid-gesture, fingers outstretched. She has a simple line for a mouth with no lipstick. The motion of her hand causes the sleeve to slide slightly down her forearm. Both women’s hair shifts lightly in the breeze. The camera is positioned at chest height, facing them directly, capturing their gestures, facial expressions, and clothing motion. Shadows move subtly across the grass from nearby tree leaves. angle, clearly showing her facial expression, upper body, and the background details.
char_daphneblake
Primary Outfit: A purple long-sleeved mini-dress with lighter purple stripes at the hem, a lime-green scarf tied around the neck, pink tights, and purple heels.
Appearance: Fair skin, shoulder-length wavy orange hair, a purple headband, purple eyes, and pink lipstick.
char_velmadinkley
Primary Outfit: A baggy orange turtleneck sweater, a red pleated skirt, orange knee-high socks, and red Mary Jane shoes.
Appearance: Fair skin, freckles, short reddish-brown bob hair with a small red bow, and thick black-framed square glasses with light blue lenses with visible eyes. Her mouth is a single simple line, no lips.
I captioned the characters as “char_name” and then at the bottom of the prompt used appearance and outfit trigger words to trigger them. So far it has been pretty consistent.
Closing:
And that’s it. I thought this would be a short guide, but its clocking in at around 40 pages. This is part two of three. I’m not sure what there is left to write for part three, but I’ll get to it eventually.
Do check out my captioning tool https://huggingface.co/spaces/comfyuiman/loracaptionertaz which I vibe coded with google gemini build. It works with qwen local and google gemini to caption in the same exact way I caption. I used it to build this scooby doo lora.

