Dataset
Most of my datasets come from Flickr. I download the largest size image available, and then I crop my images into one the following formats:
512px * 512px (or 1:1 aspect ratio)
384px * 640px (or 5:3 aspect ratio)
640px * 384px (or 3:5 aspect ratio)
576px * 448px (or 9:7 aspect ratio)
448px * 576px (or 7:9 aspect ratio)
In selecting a dataset, I look for images that are not very complicated or filled to the brim with details. I try to make sure I have images of the following in my dataset:
All times of day (night, sunset, twilight, dawn, mid-day, etc)
All types of indoor lighting
Neon lighting
Lens flare
Out of focus subject
Bokeh background
Studio portrait
City
Nature
Gradients
I remove any borders or other unwanted items in the images.
Captions
I used to make captions for the entire dataset, but I've since realized that's not worth the time. Providing extensive captions for 1000+- images takes hours. While captioning works, it never works perfectly. Captioning itself will introduce some unwanted biases, while aiding the generalization.
Regularization
I used regularization images for a while, but I found it too difficult to balance against my dataset.
Kohya parameters
Word of advice on parameters:
Start out making use of as few parameters as possible. You want to start out with a baseline where you are capable of learning from the dataset. Don't mind the bad generalization and unwanted side effects.
If you add too many unknown factors (parameters) at a time, it's impossible to derive knowledge about a single factor. I made this mistake myself, trying to rush into finding the perfect workflow. As a result, I ended up going back and forth on most parameters many times. It took me way longer to get to where I am, than it should have.
I would recommend a baseline of:
Optimizer=AdamW
Weight decay=0
Scheduler=constant
Dimension=64
Alpha=64
U-net only
Then I would conduct a series of experiments for each parameter I'm curious about trying. Start with the working baseline and add a single parameter. Try it with a few different values and write down your observations. Then move on to experiment with the next parameter, without the previous parameter you experimented with. You can later go on to test things in combination.
Type
I currently use the Kohya LoCon preset. The expanded capabilities provide better reproduction, but compared to LoRA, it's a bit more difficult to minimize the effect on composition.
Batch size & gradient accumulation steps
It is a lot easier to train a film stock LoRA with a high batch size, as it helps avoid bad learning which would require weight decay to smooth away. I currently use a batch size of 8 and 4 accumulation steps (as batch size 8 is the highest my GPU is capable of).
Update
Not sure about this any longer. In the latest version I'm using a batch size of 1 and a low max grad norm. I find that sometimes using a higher batch size than 1 can result unwanted generalizations. This might not be an issue if you can train with very high batch sizes.
LR Scheduler
You will get better results by going with a scheduler that ends on a lower learning rate: cosine, polynomial etc. However, that does mean that you need to get the length of your training right. If you decide that epoch 4 out of 8 looks best, it probably would've been better if you just trained 4 or 5 epochs, as it ended on a lower learning rate.
My favorite is OneCycleLR. The only downside is, it doesn't (really) work with Prodigy or D-Adaption.
Update
As of lately, I favor a constant learning rate. This avoids a bias towards parts of the dataset. It's especially useful if you run a shorter training to (lets say) refine your existing weights.
Optimizer
I have tried to incorporate learning rate free optimizers (Prodigy and D-Adaption) in my workflow many times. In my experience, the learning rate optimization of D-Adaption and (especially) Prodigy is very fickle. It gets thrown off very easily by other settings, leaving you with wildly varying results.
I recommend AdamW, Lion or Adafactor.
Loss
I recommend using loss type Smooth_L1, with Huber Schedule SNR and Huber C of 1. I find that L2 tends to overfit on details instead of adapting traits. Using an SNR Huber Schedule adjusts the Huber C value to the signal-to-noise ratio of your dataset.
Update
After much testing, I prefer a constant Huber Schedule where I set the Huber C myself. I recommend a value between 0.9 and 1.0, as it reflect the fact that a film stock dataset is highly noisy with lots of outliers. It's more conservative, less likely to create unwanted generalizations. Keep in mind that, as your training proceeds, the MSE part of Huber loss comes more into effect, as your weights become more accurate. If you want to refine the weights of a LoRA you've already trained, it's probably a good idea to increase the Huber C, to avoid rapidly overfitting the good weights.
Weight decay
A film stock LoRA is a complicated concept to learn. It is defined by many subtle traits that will be spread unevenly across your dataset. If your weight decay is too high, you will smooth away all of the subtle traits. Your LoRA will only be able to reproduce the most obvious and frequent traits in your dataset.
Dropout
I recommend using rank-, module- and network dropout. You need this to create a robust LoRA that balances the traits of the concept well. However, I don't use dropout as my main solution for generalization as it will smooth away subtler traits.
Network dropout: 0 - 0.25
Rank dropout: 0.1 - 0.25
Module dropout: 0.1 - 0.33
Scale weight norms
I don't recommend using scale weight norms for anything other than enabling monitoring of maximum- and average key weight, by setting it to 10. I've yet to experience a LoRA reaching key weights of 10 unless it is already overfitted.
On that note, I find that film stock LoRAs should have an average weight between 1 and 2.
Network Rank / Network Alpha
In my opinion, a network rank of 16 is the lowest you can go. I have been happy with both 32 and 64. It is easier and faster to train a higher rank, but it requires more GPU memory.
There's a lot of confusion online about what network alpha is, and what it is intended for. I have had no problem setting it to the same as network rank, leaving me to focus on other parameters to fine tune my results.
Blocks & timesteps
It is possible to optimize blocks and timesteps to minimize the effect your LoRA has on composition. However, I would only recommend experimenting with this as a last step when you have a good LoRA to compare with. Stripping away blocks and timesteps can seriously hamper your LoRAs ability to impact the output, and also introduce instability.
Epoch / Repeat
There's a lot of conflicting information about what epoch means and what the practical implications of it are. You will find people online talking about how many epochs and repeats you should or shouldn't use. I suspect this is only meaningful if people are using the same parameters, from the same guide and training the same kind of LoRA. My understanding is that epoch is simply a representation of your dataset in it's entirety - one repetition of it. Repeats, on the other hand, simply allows you to increase or decrease the presence of one group of images in your dataset, against others (in essence balancing concepts).
I use total steps instead, to configure how long I want to train and simply apply 1 repeat to all images. As I mentioned earlier when talking about weight decay, the concept of a film stock is very subtle. I think the only chance you have of balancing your dataset, would be after you've created your first version, and can see which traits you want to strengthen (although I don't do that myself). Typically, if you increase batch size, gradient accumulation steps or network dropout, the LoRA becomes more balanced.
Training
My folders are structured like this:
img / 1_ (1 repeat - no class token)
Or this:
img / 1_misc
1_day
1_night
1_studio
1_twilight
1_neon lights
1_flash
The class token are ignored with caption_dropout_rate of 1.