Wan Motion Training with Musubi Tuner: Part 1

Hello all, I have been asked to write an article on how I do my Wan Video Training on my Local PC using Musubi Tuner.

This guide accompanies my YouTube video on this subject, and should be of assistance for those that watched the video and simply want to copy settings etc.

This is such a massive subject that I'm going to make this a series of articles. This series will go something like this (and I'll update this list as I write each part):

Fundamental concepts in LoRA Training (all models and training suites) (this article)
Musubi Tuner - introduction, installation and organisation of workspace
Datasets
Captions
Configuration files
Training and monitoring with Tensorboard
Training outputs and inference testing

Fundamentals to Motion LoRA Training for all models and training suites

Training a video AI model to learn a new motion is challenging and is significantly trickier when compared to creating a character LoRA for SD, for instance. So at the outset, you want to ask yourself:

Is the motion relatively simple (and therefore trainable)?
Does the base model checkpoint or a merged checkpoint with embedded LoRAs already know this motion?
Is a LoRA already made by another that performs this motion?
Have I successfully trained SD LoRAs for image generation and so know the basics of LoRA training?

Ideally you want to answer appropriately for all of these questions, since simple motions are the most trainable, you've gained experience and understanding of LoRA training in a simple form with SD image/character LoRAs, and your motion is unknown to the base model and no-one has already made a LoRA of it.

If prompting doesn't work on a model then the training data doesn't exist

The fundamental of AI models regarding motions, shapes, anatomy or objects remains the same: If there is no training data within the model on that subject, no amount of prompting will produce a result.

You test this by prompting the model for the motion (e.g. "girl licks and sucks her fingers" and all variations of that) and if it's a mess or it doesn't happen at all then this tells you the model has not been trained on videos of anyone licking or sucking their fingers.

This then gives you the motivation to add your own training data, in the form of videos of people licking/sucking fingers by creating a LoRA (low rank adapter) which you then add to the model during inference (prompting).

Fundamental Concept in LoRA Training: Loss

Motion LoRA training works on the basis of you providing videos of a motion along with captions, and from those captions it attempts to create its own "video" (in latent space) of that motion. It then compares your video with its own and computes the difference as a "loss". The further away it is, the worse the loss is. So it tries again, and again, and if the loss is less, it means it has gotten better, and it then learns from that by strengthen certain weights within the latent space of the model and focuses its next iteration toward it, gradually improving its prediction and "understanding" of what the motion is. This concept applies to any machine learning, by the way, not just motion or image LoRAs.

Dataset Selection: Train only one thing at a time

You don't want to overload the training model with too much disparate information. If the motion involves a hand holding an object and moving back and forth in a certain bodily orifice, for instance, then you want to reduce as much as possible any other variations in those videos, particularly regarding the pose and camera angle. So ideally you would have many videos of people doing that same action in the same pose from the same angle.

This would be the basis for your first optimum checkpoint (completed set of model weights for your LoRA that is perfectly fitted to that motion). You can then layer new poses on top of this first checkpoint in a new training session to generalise the LoRA so that it can reproduce the motion in different poses and actually increase its understanding to the point where it can potentially do the motion in poses that you haven't trained it in.

The key is train one pose and camera angle at a time, then add each new pose or camera angle in successive training sessions to get a generalised understanding of the motion into your final polished LoRA.

An important note here is you must vary things that you don't want it trained, such as using the same person in all of your training videos, or the same background. If you don't vary this it will think that that face or that background is a part of the "thing" you're training and will then change the face of your character in inference - a common complaint by users on poorly trained LoRAs. In short:

Don't vary that thing you want training (the specific motion)
Vary those things you don't want training (faces, backgrounds, etc.)

How to Caption, tags or natural language? Caption or void?

This is probably the hardest thing to know when training. And can basically be split into two concepts:

Should my captions be a series of tags ("1girl, fingers, mouth, licking") or natural language ("a young woman licks her finger then sucks two fingers")?
Do I caption something or leave a void (no caption) for that thing?

The first concept, of the style of captioning is very subjective in some models, where the model authors have been vague about specifying what works (e.g. Wan Video), or more objective, like in LTX where they lay out specifically how to arrange your captions and in what order. Personally I have tried both tag and natural language style captions in Wan and gotten good results in either, so perhaps this issue is not sensitive in that model.

The other concept, do I caption or not at all is more interesting. Models in general are pretty smart, owing to their billions of training parameters, and when you present your training video to them, they can usually see what the basic set up is, for example, "a girl sitting on a sofa with her legs to one side". If you used a keyword ("anything_works_here") and described everything in the scene except the motion then in theory it would still train that motion since eventually it would try again and again to get the loss down by randomly producing some evolved motion that just happened to work, and it will record that weight and then try to enhance it again in the next iteration, with the understanding that your keyword is the common touch point in every caption and therefore the thing it needs to learn.

I haven't tried to prove this myself but the point is, do you caption in detail or caption minimally? I have tried both and the results are not compelling either way. I'm sure there are trainers out there who can explain their captioning strategy better than I.

Overfitting

Every system has a sweet-spot and LoRA training is no exception. The progression of a training run, if it is successful, would be something like this:

Model fumbles about initially trying to figure out what you're trying to train (underfitting)
Model learns the pattern of motion and is able to reproduce it in inference (perfectly fitted)
Model has learned the motion pattern but has also "learned" content in the training data which you don't want (such as a person's face, or the colour/content of the background) and will puke out those unwanted things in inference (overfitting)

The sweet spot is often found when the loss stops reducing and flattens out, but not always. The only way to really know is exhaustive testing in inference with the different checkpoints that you produce in a particular training session.

These concepts apply to any video model (Wan, LTX, Hunyuan etc.) and any training ecosystem, so you can port this understanding to future work you may do in other models and training suites.

In the next part, Part 2, I'll get into the specifics of Musubi Tuner.

If you found this article helpful then please consider supporting me at my RiotModels Page, and your reward is exclusive explicit video content of the Seven Sisters of Love.