Lessons Learned from the Urban Decay Flux LoRA
I’ve completed another round of experiments with Flux LoRA training and wanted to share what I’ve found. I trained five versions of Urban Decay, using a different method each time and carefully judging the results. Only two of these have been published, but I can make the others available to anyone who wants to try their own experiments. First, I’ll go over my baseline settings, then dive deeper into each experiment and what I’ve found.
Typical Training Configuration
I’ve done quite a bit of experimenting and found some settings that tend to do well with some of the more artistic style LoRAs I’ve been training. I build the datasets according to the guidelines in my dataset preparation guide. Once I start training, they’ve done well using the following settings, which have become my “default” for a first version of a new LoRA:
20-30 images
Captioned with GPT vision model, human edits to remove references to the style
512 resolution, x1-2 repeats, batch size 2
~2000 training steps (rounded up to the nearest epoch)
0.0003 learning rate (0.00015 x batch size)
64 alpha, 64 rank
These settings worked out well when I trained version 1 of Urban Decay. The learning rate is low, but the steps are high. It takes longer to train, but I’m much less likely to overshoot the sweet spot for the first version. The alpha and rank are set high, but doing an initial training with more significant numbers beats wasting 6-7 hours of training on a LoRA that did nothing because it was too small. These numbers are only significant because they stayed constant throughout, other than what I changed to test each version; therefore, everything could be compared back to version 1.0.
Version 1.0 came out okay, especially for the first version of a new LoRA. The style was captured perfectly, and the LoRA worked well when triggered at full strength. However, people prompted into the images sometimes came out distorted or disfigured, much like you’d see in early SD models. At first, I thought it was because there were very few people in the training dataset, but after a few experiments, I think it's likely more an artifact of training with Flux itself and may be able to be improved.
Original Flux-dev versus Flux-Dev2Pro
John Shi posted an article on Medium.com discussing some of the inherent difficulties using distilled models for Flux LoRA training. As a solution (at least partially so), the use of a fine-tuned Flux-dev model, dubbed Flux-Dev2Pro, was recommended to overcome some of the training issues. As v1 was trained on the traditional Flux-dev model, I kept the dataset and everything else the same and trained v2 on Flux-Dev2Pro. Here’s a sample of the results:
Looking at v1 and v2 side by side, you can see that the earlier distortions and disfigurements I mentioned are present in the v1 model but not so much in the v2 model. The style may be more heavily applied in v1, but v2 is far superior in every other way. I ran at least 30 image tests and found that v2 was consistently better than v1 (but not always perfect). To be sure, I trained another model featuring primarily people and got similar results. Though the original Flux-dev did just fine for most of my art style LoRAs, when people were involved, Flux-Dev2Pro was better.
Regularization Datasets
Regularization datasets are a funny thing. Many creators swore by them for SD 1.5, but I saw fewer mentions about them when training on SDXL (and my SDXL LoRAs came out fine without them). I would guess by now, only a handful of creators are using them for Flux training. So for v3, I loaded up the v1 training config and dropped in a regularization dataset. See the end of this section for more details on the regularization dataset.
So, v3 had some unique qualities apart from v1 and v2. Notice how clean and neat everything on the table is compared to the other versions. The room (what the LoRA is supposed to focus on) is the same, but everything else looks “normal.” This was pretty typical of the rest of the results as well. Using the regularization dataset helped the model determine exactly what was part of the LoRA (the room) and what wasn’t (the table and people) and separate those concepts. Overall, this isn’t what I wanted from this particular LoRA, as I wanted everything in the image to adopt the same entropic and decayed look. However, it certainly would be a helpful thing to have in your toolkit for making some LoRAs—especially for concepts Flux found challenging to isolate from the rest of the elements in the training dataset.
In addition, using the regularization dataset helped prevent a lot of conceptual bleeding. All other versions of the LoRA held onto some of the style even when the trigger word wasn't used, though the amount differed depending on the version—v3 was the only exception. When using v3 without the trigger, the image appeared almost identical to the images using no LoRA at all. This makes me believe the model has a much firmer grasp on the concept being trained and the token that represents it.
I’m not going into all the details on how to use a regularization dataset (since it varies by trainers), but I will explain how I made them. Making the regularization dataset is a relatively simple task:
Grab all the captions from your regular dataset. Note that your captions shouldn’t mention what you are training, so they are pretty much descriptions of what the images would look like without the LoRA.
Take those captions and use them as prompts. Generate AI images for each image in your training dataset. Rename each AI image to match its non-LoRA counterpart.
Caption your AI images using the prompt that created them.
Place all of your AI-generated images in a folder with their caption files. Regularization dataset done!
Captionless Datasets
I’ve already written about captionless datasets (and it's been covered by other creators as well), but I did want to confirm some of my previous conclusions by including it as a variable in this experiment. So, v4 was trained with no captions. The results confirm some of my earlier experiments.
The style worked well with the captionless LoRA, but it had problems generalizing and following some prompts. The prompt for the above image was, “a cluttered walk-in closet with clothes and shoes everywhere, a person searching for something on the top shelf.” The v4 LoRA seems to have a problem figuring out what a closet should look like in the Urban Decay style—there were no examples in the dataset. However, despite the lack of an example, the v1 model has no problem figuring it out. The v1 LoRA also didn’t have an issue with creating a person reaching to the top shelf, which was either ignored or not understood by the v4 model. I believe that the captionless training method will work best if:
The LoRA doesn’t have to generalize much outside of its training data, and/or
The training dataset is very diverse.
Multiple Resolutions
I also had previously written about multiple resolutions and the effect of using subsets of your dataset at different training resolutions. For v5, I trained it at both 512 and 768 resolutions. The effect on the crispness of the image is slightly noticeable. Interestingly, v5 is also somewhat more colorful than v1, which is more evident given that this is a pretty drab LoRA. This tells me that higher-resolution training is better, which we already knew.
However, I can't confirm my earlier results. Urban Decay radically changes the image so that the compositions between the normal image and the LoRA image are very different. So, while higher-resolution training will always be better, it is inconclusive on whether or not multiple-resolution training is better.
Conclusion
Through these experiments with Flux LoRA training, I've found that the choice of base model, the use of captions and regularization datasets, and training resolutions significantly impact the results:
Using Flux-Dev2Pro over the original Flux-dev model improves the handling of human figures and reduces distortions.
Regularization datasets can help the model better separate the concepts you're training from the rest of the image, though they may not always produce the desired stylistic effects. If your LoRA suffers from concept bleeding, it may be necessary.
Captionless datasets may work for styles that don't require generalization beyond the training data, but they can struggle with new concepts or prompts.
Multiple training resolutions show slight improvements in image crispness and color, but results are inconclusive due to the radical changes introduced by the Urban Decay style.
You can produce better and more consistent results by carefully selecting your training configurations and understanding how these variables affect your LoRA. Keep experimenting, and you'll continue to improve your models.