Training Musings: Flux Rapid Training (or How I Spent 4000 Buzz trying to break it so you don't have to)
Last Night, I decided to test Civit's new Flux rapid LoRA training (along with the new JoyCaption module)
The Claim: A functional LoRA within 5 minutes.
The Cost: 4000 Buzz for 100 pictures.
I set out to see how much I could throw at it.
Based on the section in Civit's Education site, I figured that character training would probably work pretty well, so I considered concepts and styles to test, finally settling on the art of Akira Toriyama, legendary mangaka of Dragonball fame.
Link to Lora: https://civitai.com/models/747595?modelVersionId=836035
Part 1: Collecting a Dataset
While it's not always my methodology, for this experiment, I wanted something different:
Only official Akira Toriyama art.
No synthetic data
No fanart, just pieces from the man's own hand.
Wide spread of works, to avoid Dragon Ball Bias - that meant mostly Chrono Trigger and Dragon Quest pieces.
Not only simple character art (and I did not use any monsters from DQ/DW)
I hoped to get the full 100 pictures, but due to the last two points, the final result was a dataset of 45 hi-resolution images.
Some of these images have 4, 5 and even 6 characters in them. That was intentional. They also take place in very busy backgrounds and environments. I also did not take art of every single main character across DB/CT/DQ; I did take some of Goku, despite Flux already knowing his adult likeness pretty well.
Part 2: Captioning
I originally was going to simply use an activation tag, but after re-reading CrasH's training diary for another Manga/Anime art style (Full Metal Alchemist), I opted to also include JoyCaptions.
This is where I likely could have done better. Being used to the full model's overly flowery language and unnecessary details, I kept the 'temperature' of it at 0.5 and put a max of 75 new tokens. This was not enough to fully cover descriptions of the images in many, many cases, due to the prevalence of those with either very busy backgrounds, multiple characters, or both.
Even if I'd left the setting maxed out, 100 tokens on the Civit Trainer would likely have been too low for some images.
As we'll see though, it wasn't character descriptions that ended up being the main issue. My failing here was not going through each caption to properly edit and fill them out. I did some, but I did miss a few that were incomplete.
Action List:
Review captions
Edit captions
Be more thorough in making sure the amount of people mentioned is accurate!!!
You can find the dataset, and captions (in all their improper glory), attached to this article.
Part 3: Baking a Cake (or the LoRA)
There's nothing to really specify here: Rapid Training doesn't let you pick any parameters.
Just pay your buzz and Drink Water. There won't be time to make a cake.
It'll probably done by the time you're back.
Total training time: 2.5 minutes
The results!
One file is returned: a single epoch, and in my case the file size was 86MB, meaning a Network Dim of roughly 8. As I've only done this once, I have no idea if that's fixed or if the value is dynamic based on the amount of images.
So how good are the images?
Pros
It absolutely nails the style. It'll have more of a tendency to break on known characters with their own art style that Flux might have. Judge for yourself:
You can see that Mario is not as accurate as the others in terms of the style, but neither is he in his default style. The "Krillin" image is interesting, because he wasn't even part of the dataset, and yet, through prompting, you can get an almost exact likeness.
Cons
It learned far too much extraneous details. This is likely partly due to captioning, but I believe that the Network Dim was also simply higher than it needed to be for this set - but would the results have been as good on style otherwise?
The biggest failings:
Generates extra people when I only prompted for a single one
Adds a lot of extra detail and clutter to the scene (potentially good?)
Overfit for warrior types: I asked for Bender from Futurama and it made him into the protagonist of a Dragon Quest game. The image looks great though!
A lot of grand green landscape and barren, rocky landscapes for backgrounds. Obvious, given DBZ and DQ, but the dataset was not overly biased towards those. My theory is that because the JoyCaptions recognized the art as being Dragonball-like, it incorporated some of those concepts into the LoRA on its own...
Extra people
Busy/Detailed background
CONCLUSIONS
This is probably excellent for character/celebrity likeness LoRAs
Curate your dataset for quality and training efficiency: while I feel the group action shots are important for me to include in this LoRA, as they demonstrate the style more than random characters, it's clear they had a heavy influence.
Do not simply auto-caption and forget. If the Civit tagger is coming up short for complex scenes, use the full JoyCaption model and edit things down, or complete the captions manually.
Share your own conclusions if you use it!
Hopefully this helps people along!
And please, if you can spare it donate some buzz so I can attempt testing this more/compare it to 'traditional' training of the dataset once I fix the captions