I've been doing a lot of experiments with LoRA training since SDXL came out, and I've got some advice that I'd like to share with everyone. This is to some extent a response to the excellent article by @AI_Characters, which I would encourage you to read first if you haven't read it. I've found that most of the settings in that article work very well in a lot of cases and are an excellent baseline to work with. I fall back on the method in that article in the odd case where my own settings don't work very well, and it's generally been very helpful. Anyway, now with the tips:
First and foremost, if you take away nothing else from this article, it's that there's no one single universal way to train LoRAs. Your success training with different settings depends a lot on what concept or style you're trying to train and the data you're training with. I have yet to find any particular settings that work optimally in every situation, and the best way to get a feel for what works and what doesn't is experiment! Don't just take my word or anyone else's; try different things and see if you get better results. Specifically, train more than once on the same data and vary a setting at a time, and see how things change. If change more than one thing at a time, it'll be difficult to determine how each change affected your results.
Secondly, you want to make sure that your dataset is as consistent as possible, particularly when training a style. It's better, in my experience, to have a smaller number of images that are of a very consistent style than have a larger number with a few that don't fit quite as well. I had one experience where I was struggling with training with about 30 style images and removed two or three of them that didn't fit very well and suddenly the training converged in just a couple of epochs.
Third, try training with different schedulers. AdamW8Bit is kind of a brute force method -- it'll get you there eventually, but if you want it to work, you have to go pretty slow with it, so it can take a lot longer than is strictly necessary with other methods. Prodigy is generally the first thing I try, and it will often converge in just a few epochs with results that are at least as good as I'd get from AdamW8Bit. I haven't tried all the schedulers exhaustively (I have had some amount of success with DAdaptation), but you may find it's worth experimenting with them as well. I generally run Prodigy for 15-25 epochs, although often it'll work in just a few.
Occasionally, Prodigy will train too aggressively and you'll end up with a mess. If this happens, I recommend reducing the learning rate. Prodigy's learning rate setting (usually 1.0) is actually a multiplier for the learning rate that Prodigy determines dynamically over the course of training. If your results are all over the place, try turning it down. I've never had to go below 0.25 or so (although it wouldn't hurt to dip below that if you're really having trouble). One other thing I like to do when a LoRA looks overtrained is to test it at a reduced weight. Sometimes a LoRA that looks terrible at 1.0 will look great at 0.4-0.6. One final note, when training on a 4090, I had to set my batch size 6 to as opposed to 8 (assuming a network rank of 48 -- batch size may need to be higher or lower depending on your network rank).
Fourth, try playing around with training layer weights. When I'm training a concept as opposed to a style (and I want to minimize the amount that the style of my training data affects the output LoRA), I adjust the layer training weights. I haven't experimented with this very much so these may not be the optimal settings, but I've had good luck with the following:
Down weight: 1,1,1,1,1,1,1,.5,.25,0,0,0
Mid layer weight: 0
Up weight: 0,0,0,.25,.5,1,1,1,1,1,1,1
Please bear in mind that this is FOR CONCEPTS ONLY. If you're trying to train a style with these weights, you'll probably end up with garbage (although I haven't tried it, so give it a go if you're curious and see what happens). The theory here is that the inner layers are fine details and the outer layers are larger features of the image, and the style will be affected more by the fine details. One thing that I haven't attempted yet is training styles with the reverse of these settings. I'll update this article once I've done that.
Fifth, sometimes LoCons work better.
Finally, when selecting what model to train a LoRA on, remember that a LoRA is a delta. For concepts, you'll almost always want to train on vanilla SDXL, but for styles it can often make sense to train on a model that's closer to the style you're going for. If you want an exact style, just train your LoRA on the model you're going to be using it with. If you're looking for something that will bu usable with multiple checkpoints, there's a bit more to consider.
A good way to think of it is that a LoRA works kind of like a latent vector. Applying that vector to different checkpoints will change the output in the same way. For example, if you want your style LoRA to thicken outlines (assuming your data has outlines), train it on a checkpoint that doesn't have any outlines at all, so that the LoRA will end up altering the checkpoint to add more outlines. Thus, when you apply that same LoRA to a checkpoint that already produces images with outlines, it will strengthen the outlines that are already there. If you want your LoRA not to affect outlines at all, train it on a checkpoint that has outlines that are roughly the same as the outlines in your data images. If you want your LoRA to exaggerate the anime style in general (assuming your data is in an anime style), train it on a checkpoint that's not anime style or has only minimal anime influence... and so on. In general, the less similar your checkpoint's output and training data are, the more those difference will be exaggerated in your final LoRA.
Anyway, that's all for now! I'm always interested in learning from other people's training experiences, so if any of this stuff doesn't work for you (or works really well), please leave a comment and let me know!