Establishing Credentials
I've generated some fairly popular LoRAs on here. The quality is solid, the movement is consistent, they do not override faces or styles, and they stack with other LoRAs very well. Feel free to check out my profile to see some of them. You may need to log in to see some of them, though.
This is primarily a guide for Motion LoRAs that use natural language.
Experimentation and Learnings
Through the past few months, a group of engineers and AI enthusiasts have been testing various LoRA training methodologies, and configurations. Things like modifying learning rate, batch and gradient accumulation sizes, weight decay, dataset size, captioning, etc. This guide will provide you a succinct understanding of our findings, so that you can not only configure your trainings better, but also understand them.
Captioning
The most important thing to get right
Apparently ChatGPT and other online discourse has been instructing people to caption their datasets incorrectly.
Common mistakes:
Thoroughly describing every aspect of the image, including what you want the LoRA to learn.
If you are training a character, describing their hair color, face, clothes, body, etc in detail will actually hinder your training because the generated output and the dataset will not have a significant diff to learn from on those attributes.
Adding all of your trigger words at the beginning of your captions.
Failing to caption the background and unrelated elements.
Simple tips:
Caption your dataset as if the LoRA already works. For example, a example dataset caption to teach the Scrubs-inspired Fortnite Default Dance might look like this:
A dark-skinned man in blue scrubs, with a name tag pinned to his scrubs. His hair is shaved very short. He is wearing sneakers. He is in what appears to be a hospital break room with smooth grey floors and white walls with a wooden accent in the middle. wooden cabinets with a microwave on top, an orange chair, a metal rolling bookshelf, and paintings on the wall in the background. He is doing the default dance.
Realistic.
Caption your dataset as if you were trying to recreate the video/image perfectly.
The LoRA training works by generating an image with your caption, then comparing it to the related dataset image. It will identify any differences between the two (known as loss), and update its weights to try to correct that in the future.
In the prompt example above, if you do not include the caption of the background, then there's a good chance that the generated videos/images wouldn't include those bushes and that would be considered loss that needs to be corrected. The LoRA would end up training the LoRA to put everyone in a hospital room with that stuff in the background.
Your captions should not be redundant. Notice how the above example doesn't mention "He claps his hands and moves his arms up"? That would be redundant, because the dance LoRA should already handle that.
If you included "The man claps and moves his arms up", then the generated videos/images would likely already contain clapping and arms moving up, and the LoRA would not identify any loss and wouldn't be able train those behaviors as easily.
Resolutions and Datasets
Common mistakes:
Using the highest resolution data possible.
Using a ton of very different input data, different angles, etc.
Training for multiple hours without verifying the output and updating.
Not standardizing your framerate.
Upscaling/downscaling methods causing pixelation and artifacts.
Not captioning the background.
Simple tips:
Train your LoRA at a low resolution first.
128x128 or 160x160 is usually good enough.
Do a low-resolution run for a few thousand steps at a higher learning rate (5e-5 or 1e-4 is usually what I do)
This allows you to make sure the LoRA is learning correctly.
You can update captions where you might see unexpected overcooking, or the LoRA is training on things it shouldn't be (like background elements).
Once you have a LoRA that you think is learning correctly, you can now use a higher resolution.
I usually use 256x256, or 16:9/9:16 datasets of similar size using a downscaling method that does not produce pixels/artifacts. A good method is bicubic, or you can use a video editing program or commercial upscaler/downscaler.
LoRAs learn by comparing the generated 256x256 to your input dataset 256x256. Image quality is important, but not image resolution.
256x256 and similar resolutions will usually contain enough detail to train the model.
Training at lower resolutions like this allows the model enough creative freedom to still hallucinate the finer details when used in generation, while keeping true to the LoRA.
Keep your training datasets small in scope and specific. Train variations in separate sessions!
You can train individual elements of a LoRA piece by piece.
Train camera angles 1 at a time, not all together.
Train different outfits in separate datasets, not together.
You can enhance and "further train" a working LoRA, so just focus on building the LoRA piece-by-piece instead of trying to get it all working in 1 fell swoop
Start with a higher learning rate, and reduce it as you finetune.
Change your datasets!
If you train on 20 videos and 10 images, your loss will continue to go down over time. If you introduce a novel dataset, you can much better protect yourself from overfitting.
Loss is not everything! If your captions are not good, your loss can go down over time without the LoRA functioning as expected. Loss rate is just a measure of how close your LoRA's generated data matches your input dataset.
Caption everything that your LoRA is not supposed to control. Detail the background colors, background elements, lighting, style, etc. If your LoRA is not supposed to control faces, you should describe faces as well. I like to crop my datasets so that faces are out of frame to avoid this, and instead I can just include "their face is out of frame" in my caption instead.
Standardize your dataset framerate. If you have some of your dataset at 60fps, some at 24fps, and some at 8fps, your LoRA may have unintended slow-motion or fast-motion behavior. Try to keep your dataset framerate relatively consistent.
Quick Notes:
Build an initial dataset of 10-20 videos/images.
Use new datasets of 10-20 videos/images to finetune and train new behaviors.
Caption this dataset with: high detail, not repetitive, and use the LoRA trigger words as if they are working perfectly. Describe the image in enough detail that it can be perfectly recreated.
Train a low resolution LoRA (128x128 or 160x160) first. This is much faster and you can catch any dataset issues early.
Train one thing at a time.
Once you are confident your low-resolution training is working as expected, you can increase the resolution and retrain. I have never found a need to train higher than 256x256.
Keep a standard framerate.
Here is a template I use when captioning my datasets:
[Describe the actors and their poses/positions]
[Clothing and accessories]
[Body shape/size, skin color, tattoos and skin details]
[Hair color and style, eye color, eyebrow shape, lip color, etc]
[Describe the location, furniture, background elements]
[Describe their actions, where they're looking, what they're doing]
[Style, Camera movement, Camera angle]
In fact, using the above guide, I generated a really rough and simple Default Dance LoRA, and here's a cat doing the default dance: