Sign In

Wan Motion Training with Musubi Tuner: Part 3

0

Wan Motion Training with Musubi Tuner: Part 3

Part 3 of my series on training new motions for Wan Video in Musubi Tuner.

This guide accompanies my YouTube video on this subject, and should be of assistance for those that watched the video and simply want to copy settings etc.

This article is part of a series:

  1. Fundamental concepts in LoRA Training (all models and training suites)

  2. Musubi Tuner - organisation of workspace, datasets and commands

  3. Datasets (this article)

  4. Captions

  5. Configuration files

  6. Training and monitoring with Tensorboard

  7. Training outputs and inference testing

This part assumes you've read and gained an understanding of the basic concepts of LoRA training (Part 1) and have installed Musubi Tuner and its dependencies and initially organised your workspace (Part 2). With that done, let's get into the details of preparing your training dataset.

Dataset Introduction

The dataset is the basis of real world training information to impart to the model. In the case of video motion training, it will be short videos of real people doing a specific motion.

As we discussed in Part 1, this motion should be relative simple. And naturally, it should be something that you've proven the base model doesn't know anything about, by prompting that base model in ComfyUI inference and finding non-compliance in the output video.

Simple Motion Concept

What do we mean by "simple"? Actually, we mean simple and consistent in its encapsulation of a concept even though the set of body articulations may be complex. Examples:

GOOD:

  • licking and sucking fingers

  • spinning the whole body 360 degrees while standing

  • rolling the head about the shoulders

  • a single pole dancing move such as lifting up the legs 90 degrees

BAD:

  • a series of varying pole dancing moves

  • Olympic triple jump (hop, skip and a jump)

  • a ballet dancer skipping forward and back and then twirling around

Can you see that the good examples have one motion concept, despite the complexity of articulation of joints required? The bad examples are mashing together multiple different motions which makes it much harder for the AI to "learn" or separate out.

Consistent Pose and Camera Angle

An additional important concept is to keep the camera angle and pose as similar as possible. This helps the model learn the motion as it attempts to "guess" the motion by producing it from your caption and then computing the loss. It can do this more securely and quickly if you're not throwing in variations in the initial pose of the body or in a completely different camera angle, such as from the side or front.

But what if I want a more complex motion or multiple different motions or more poses in the same LoRA? Yes, this is something you ultimately do want in order to get a generalisation of that motion, with it being reproducible in potentially any pose or camera angle. But you do this by layering each new motion on top of the existing trained outputs in separate training sessions. How we do this will be covered in the commands section in a later article.

Preventing Overfitting

Overfitting it where the model has latched on to things you don't want trained, such as the colour of curtains or the sofa, or the shape of the person's face or their eye colour. A common complaint of improperly trained motion LoRAs is "it changed my character's eye colour to brown", or "it changed my character's face shape". Now if you were doing text-to-video character LoRA training then yes, you'd want that, but that is an entirely different subject. Note the title of this series "Motion Training" not "Character Training" LOL.

So how do you prevent this? Simply by varying your dataset so it doesn't use the same person all the time, or has the same background in it. This is why a minimum of 20 videos in your dataset is recommended. Use different people, ideally a minimum of four, and don't have them all have brown eyes or dark hair. Chuck in some blue-eyed blondes like Kayla here to break it up.

Dataset Size

Your dataset should be of a certain size, meaning, a certain number of videos of that motion. The question of "how many videos do I need?" is hotly debated. Too few and you will quickly get into overfitting as we just discussed. Too many, and the effort of framing and captioning a large dataset becomes burdensome.

Personally I have made successful LoRA's with as little as 18 videos, but typically I aim for between 20-30. You get into diminishing returns for your efforts beyond 30 in my opinion.

Video Length

Due to constraints on VRAM you want your dataset videos to only be as long as necessary to capture the motion or, if it's a rhythmic motion, a small number of cycles of that motion to provide some organic variation. The longer the videos, the more VRAM will be required, which will mean you'll need to cut corners some other way, such as reducing resolution, or by skipping every other frame, or by "sliding" along the videos (we'll cover this in the configuration files article).

This all boils down to, practically, your videos being of the order of a few seconds in duration, e.g. 2s -> 8s.

The videos don't all need to be of the same length, since you'll specify a final fixed length in your config file. Bare in mind though that if you specify a length that is higher in the config file than the length of the source video then you're just wasting VRAM.

Video Resolution

VRAM usage increases linearly with video length, but by the square of the resolution, naturally, so video resolution is something you have to watch carefully. And you should ask yourself, "Do I really need that level of resolution to capture the motion?" Probably not. You might be able to get away with 192x192. But if there is some subtle and complex motion, such as soft body distension on flaps of skin or whatever, then you might need to go higher. The most I've ever pushed it to with Wan Video is 512x512 but in that case I had to reduce the length down to 2s and slide along each video in order to not explode my VRAM.

If there's any doubt, capture your videos in as high a resolution as possible. When you get to the configuration step you can always set the resolution to much lower, and later experimentation or reduced video length in later training will mean you can use a higher resolution later without having to completely re-capture everything.

And keep things simple. Only use 1:1 aspect ratio. I've never used anything else.

Video Framing

Framing the action you need, and only the action you need is key to making full use of your limited resolution to capture the detail of that action. Any other action is potentially wasted, unless of course, the rest of the body is required to provide context to that action.

Use the video editor of your choice to tightly frame the action. For example, if you're training "finger licking and sucking", then you only need to frame the face and in fact you need to only frame the face so you capture the subtle detail of the fingers and shape of the lips. If you framed the whole body for this you've (a) made the model look at superfluous and irrelevant details and potentially confusing it, and (b) guaranteed so few pixels are available to the tiny fingers and mouth that you'll miss the motion entirely.

For NSFW-related motions you might need to include enough of the rest of the body for the model to gain the context of what is going on "down there" because base models can famously lack understanding of NSFW human anatomy. Roll on the day when the base model do know about that stuff, so you can then zoom in on that action for fine details.

Assembling your Dataset

With all of the above in mind, you've now assembled the perfect set of training videos that encapsulate a simple consistent motion concept, similar pose and camera angle, variance in actor appearance and background, and tightly frame the action, put them in a directory under your data base directory with a suitable name. I rename them to a simple number format xx (i.e. 00, 01, 02, 03 etc.) to help in caption preparation later.

You're now ready for the next, and hardest part (oh yes, you thought datasets were hard? LOL), and that is captioning your videos. We'll pick up this in the next part.

(Continued in Part 4)

If you found this article helpful then please consider supporting me at my RiotModels Page, and your reward is exclusive explicit video content of the Seven Sisters of Love.

0