UPDATE from 19.11.2023: I have finally published my new training guide and you can find it here: Create perfect 100mb SDXL models for all concepts using 48gb VRAM - with Vast.ai GPU rental guide | Civitai
This is a guide on how to train a good quality SDXL 1.0 LoRa with good likeness, diversity and flexibility using my tried and true settings which I discovered through countless euros and time spent on training throughout the past 10 months. I use the Kohya-GUI trainer by bmaltais for all my models and I always rent a RTX 4090 GPU on vast.ai.
I will explain how to:
Build the right dataset
Caption it well
Set up a RTX 4090 GPU and Kohya-GUI on Vast.AI
Use my optimal training configuration
Evaluate your model
Just look at my most recently published LoRa's for proof of the effectiveness of my methods.
Still, this guide may not ever and always 100% provide you a final best outcome. Consider this guide a very good starting point and feel free to deviate from it and experiment.
I will also not explain what each parameter within Kohya-GUI does, because there are already countless explanations out there for that. Going over each parameter would take too much time.
Last but not least, please do not message me on Discord or elsewhere to ask for help with your training. I receive these messages almost daily and I just do not have time for that. I have a full-time 40h/w job with a 1h commute one way. The remaining free time I want to use for model training and testing as well as gaming. This guide should explain most you need to know, and that which it doesn’t cover should already be explained by other guides.
First you need a good dataset. The dataset is the most important thing of any model. Any issues with the dataset will trickle down to the final output.
I generally use Artstation, DeviantArt, Pinterest, Reddit, Twitter and Instagram to find images for my datasets.
You should make sure that each image has good quality. I found that even 1 bad quality image within a dataset of 20 images can already heavily negatively affect the final output. In fact, I found this to apply to individual concepts within a bigger dataset as well. For example, if you have 300 images for an artstyle, but that is sort of split up into say 20 images of objects, 50 images of people, etc… then having just 1 bad image for the 20 images of objects can already negatively affect the output of that part of the dataset, even if it is just 1 bad image out of 300.
I find that you need at least ~20 images per character or style or concept for the likeness to come through without overtraining. I have successfully trained with fewer images than that, but that doesn’t always work. However, more is always better if the images add something to the model, like you need them for increased likeness (some concepts need more images to reach good likeness, the most extreme example I have had of that was a Makoto Shinkai-esque anime artstyle that I am currently training with 2300 images), they add diversity (say adding an image of a car in a particular artstyle when the dataset for that artstyle is only people so far), or improve flexibility (like adding cosplay photos to a character dataset that so far only included screencaps from a show). I would generally advise against adding more images just because. That just increases training time for no reason and can actually make the model worse by decreasing diversity, because now you have 500 images of cars in a particular artstyle, but only 20 images of people.
When training characters or concepts I recommend including as many different styles of them as possible, so for characters include cosplay photos, fanart, screencaps, etc… Without these it'll be hard to impossible to portray the character in different styles! The same goes for outfits. Including images of the character wearing non-standard outfits can help a lot with being able to prompt the character in different outfits later!
Also, specifically for characters I recommend including as many different POV's as possible. E.g. facial closeups, medium-shots, and full-length bodyshots! If you use only facial closeups for example, producing a full-length bodyshot will be very hard.
Furthermore, I found that for characters that are typically portrayed in a style that SDXL doesn’t already know well, like the Ghibli screencap artstyle for Nausicaä, including images of that style in the character dataset greatly increased the style likeness of the character. For example, I trained v2.0 of my Zeitgeist Nausicaä character model using 23 images of Nausicaä, as well as 24 images of my generic Ghibli artstyle dataset, a reduced set of the full 131 image dataset.
When you have accumulated the dataset, there is no need of cropping them to a particular resolution as old (or bad) tutorials recommend. Kohya and other trainers have long since implemented “bucketing”, a technique which sorts different aspect ratio images into different buckets - each for a particular resolution - and then trains them in that resolution. I won’t go into more detail here, because there is plenty of documentation online about this already.
Since this tutorial is about training an SDXL based model, you should make sure your training images are at least 1024x1024 in resolution (or an equivalent aspect ratio), as that is the resolution that SDXL was trained at (in different aspect ratios). Training lower than that can massively worsen the model, I don’t recommend it. It is fine to use a few lower resolution images in your dataset if they are of good quality and you have plenty of normal resolution to go alongside them. I find that the upscaling script of Kohya works quite well.
EDIT: Here is an example showcasing the positive effect captions have on training vs. not using captions:
https://imgur.com/a/s4l6ZoP (image was too big to be posted here)
Once the dataset is finished, you need to caption it. You want to describe everything in the image that you:
Want to be able to specifically prompt for later
Do not want to be an inherent part of a generation unless specifically prompted for
Generally I describe the following things in all my training images:
POV (headshot, closeup, medium-shot, full-length, longshot)
style (screencap in ghibli artstyle, photo, digital artwork, etc…)
subject (a woman, a car, a house, etc…)
action (running, etc…)
outfits (wearing a dress, etc…)
hair (with long blonde hair in a ponytail hairstyle, etc…)
background (with a forest in the background, standing in front of a wall, etc…)
watermarks, text, low-quality, etc…
Characters, their outfits, and specific artstyles I caption using rare tokens. If you specifically want to know what tokens are, google it, there are lots of explanations about them in regards to SD, but generally just think of them as words and letters. Rare in this case meaning a letter or word or combination thereof that in SD does not already have an associated meaning. E.g. if you prompt car in SD, you will get images of cars. If you prompt painting, you will get paintings. A rare token will just spew out any random incoherent mess of an image. I use rare tokens, because I found that using tokens with already established meanings in SD to negatively affect my training. E.g. when I captioned my Nausicaä character lora with nausicaa I was completely unable to portray her in a photographic style, because the nausicaa token inside SD is heavily overtrained on a cartoonish style.
So for example, I caption all artstyles as xxst wherein xx are just two letters (st for style basically), so my Ghibli artstyle was captioned as “screencap in ghst artstyle”. I caption all characters as xxpp (pp for person basically), their outfits as xxcc (cc for clothing), and hairstyles as xxhh (hair).
EDIT: The position of the rare token inside your caption matters a lot. Putting the rare token "kncr" at the beginning of the caption, like "kncr, a woman with purple hair wearing futuristic armor", will mean that the token is associated with the entire image and will be diluted to the point of not working correctly. Putting it next to the outfit instead, like "a woman wearing kncr outfit" will instead associate that token with the entire look of the character, including the hair, and work much better. If you instead caption it as "a woman with purple hair wearing kncr outfit" it will associate the rare token with the outfit - minus the hairstyle. This will enable you to prompt the outfit without the hairstyle alongside it - if you can also provide some counter examples of the outfit without the hairstyle. Likewise, captioning it as "a woman with kncr hairstyle wearing futuristic armor" will have the opposite effect and associate the token with the hairstyle, not the outfit.
EDIT: I was partially wrong on the subject of rare tokens. I now recommend to first try a rare token for your concept, but if that doesn't work out (usually lack of likeness) I advise to try a token with prior knowledge instead. See also this example graphic:
https://imgur.com/a/JbZYJus (image was too big to be posted here)
Lastly, you want to caption all irregularities within an outfit or person. Say you have an outfit that generally has no helmet, but in this shot the character is wearing a helmet, so you simply caption it as “wearing xxcc outfit with a helmet”. Also, I do not assign tokens to generic one-time outfits, like say a piece of fanart of an anime character wearing a tshirt.
So one of my Nausicaä captions is for example:
longshot screencap in ghst artstyle with forest background of ncpp with srhh hairstyle wearing bmcc outfit with hood with sword on her back holding shovel in hands with small fox-like creature sitting on her shoulder standing next to people
I use filenames as the captions, as I find them easier to edit using "Bulk File Rename" (a paid tool for Windows 10). But the Kohya trainer wants caption's in the .txt file format. So I use a simple Python script to convert the filenames to .txt files (I had ChatGPT generate that script for me). You can find it here.
EDIT: Here is a good example showcasing the importance of right captioning:
vast.ai GPU Setup
I always rent GPU's on vast.ai (I am not being paid by them to advertise them) for training. I find 4090's to offer the best speed for training cost. They are typically around 0.5€/h at the time of writing this article (15.08.2023). 3090's are cheaper but slower, A100's don't offer any real speed increase while being much more expensive and much less readily available.
The 24gb VRAM offered by a 4090 are enough to run this training config using my setup.
Here is a short description of the steps I undertake to setup the training:
Run the following console commands to install the Github repository I use for training and the required dependencies:
git clone -b dev2 https://github.com/bmaltais/kohya_ss
sudo apt-get update
sudo apt-get install -y libx11-6 libgl1 libc6
Create a folder called "pretrained" and upload the SDXL 1.0 model with the 0.9 VAE to it. I uploaded that model to my dropbox and run the following command in a jupyter cell to upload it to the GPU (you may do the same):
filename = 'sd_xl_base_1.0_0.9vae.safetensors'
Once done, upload your training images to the "dataset" folder. Make sure to upload them to a subfolder within that folder, with the subfolder name being "x_" - with x being the amount of repeats as specified further down in the guide (so anywhere from 1-3).
If you used filenames for the captions as I do, upload my small python script to the same folder to convert the filenames into .txt file captions and run "python create_txt_from_images.py" as a console command. You can find the script here!
Then just run "./gui.sh --share --headless" to start the GUI! Then adjust the settings as necessary (with my training config almost nothing should need adjustment) and start training!
Last but not least if you use my training config, you have to place it inside "/workspace/kohya_ss/presets/lora/user_presets/" in order to just load it in the GUI for a faster setup. You need to do this before you load the GUI however!
My training config
This training config was optimized to produce a model with a good balance of likeness, diversity, and flexibility within a 50 epoch range. Generally I find that my models are often already finished within the 10 to 25 epoch range. Most often it starts overtraining past that. Only very rarely do I have a concept that is so hard to train it needs more than that.
With this config I recommend saving as many epochs as you can, I generally save every single one, and then test them to find the optimal model. I do not recommend just running it for 50 epochs and taking that version. That will likely be heavily overtrained.
With this config the repeats should be set to 3 and no class tokens should be set. Meaning, your folder name should always be “3_”. If you do not know what I am talking about, there are many tutorials out there on Kohya and folder names and class tokens.
This config should not need any adjustments for any settings at all, its ready to go as is if you set your folder to 3 repeats, except for the learning rate. I find 3e-5 to be a good starting point for the learning rate, but rarely I have concepts that do not train well where I use a higher learning rate, usually 8e-5. Generally I find that characters can attain good likeness already with 3e-5, but that they do not have a lot of flexibility and I need to train at 8e-5 for that.
EDIT: After having trained many more LoRa's now and having created many tens of thousands more sample images and dvelved more into the realms of using multiple LoRa's in a prompt, I came to realise that many of my LoRa's were more overtrained than I thought and that I could go with a lot lower Lr for many of them. Currently my optimal LR spread using my config is anywhere from 3e-6 to 8e-5, highly dependent on the dataset in question.
Thus I must amend my initial post to make it clear that this config by my current knowledge is perfectly usable out of the box as is, except for the LR and that the optimal LR for this config can range anywhere from 3e-6 to 8e-5 - highly dependent on the dataset - and that you should test on your own what LR is optimal for you and your dataset. Sorry that there is no one size fits all answer to this, its just the nature of model training.
But at least you do not have to worry about any of the other parameters for now, as I still find them to be optimal no matter the dataset. But that may change as I become more experienced and knowledgable.
I have tested each parameter many times in different combinations over a month long period and found this combination to work best. Please do not ask me why I didn’t go for 64 Network rank/dim, or why I went with Adamw instead of Prodigy. I tested it, this is what I found to work best. Going over each parameter and explaining why I choose them the way I did would take far too long.
Evaluating your model
I generally recommend prompting one or multiple of your captions 1 to 1. If it outputs images similar to your training images in pose, style, likeness, etc… it very likely indicates overtraining. In addition to that, I recommend to also prompt for generic stuff like people, objects, landscapes, etc… and to do so in batches of 4.
Many people seem to misunderstand what undertraining looks like. E.g. when they are unable to portray the character in a different style they assume it is overtraining to the main style. However this can actually also be a sign of undertraining, as the unet has already learned the style likeness, but the text encoder has not yet associated the style with your style token, so it just assumed the style to be part of your character, as most of the training images are associated with that style. I found that later epochs or trainings with higher learning rates can actually be more flexible then. This is of course only true up until the point where it starts overtraining.
Regarding overtraining I find that it can also actually reduce likeness as it wildly jumps around different points of the training instead of finding a sweetspot.
Also, to add to finding the optimal epoch for your model: Some concepts can be more sensitive to epoch changes than others and may have one single epoch that is perfect and the one below is undertrained and the one above is overtrained. Hence: test a lot!
If you find a model to underperform, before changing the training settings I recommend checking your dataset. I found some concepts to be extremely sensitive to some images. For example, my Nausicaä LoRa would often give me this bad 3d render when prompting for photos or realistic digital art, despite the original training image being correctly captioned and the model not being overtrained. As it turns out, for whatever reason, the model was just extremely sensitive to that one image and removing it from my dataset removed those issues.
Last but not least I find that certain concepts can just be hard to train. For example I find that a generic anime artstyle as one by Makoto Shinkai is much much harder to train than anything else I have done up to this point, despite a Ghibli artstyle being very easy to train. Likewise, I find that Nausicaä acts very weirdly in regards to cosplay photos and I find that nailing down a good model that can portray her in a photographic artstyle with good outfit likeness to be very hard.
What I do different compared to other model creators
Why I don't train on 1.5
I find SDXL to always train much closer to my training images than 1.5. It also has a bunch of other improvements over 1.5. Hence I abandoned 1.5 completely now and stick to SDXL.
Why I don't train on finetunes
I find base SDXL to be perfectly adequate for training. I do not want the potential inherent biases or issues of another finetune to negatively impact my training.
Why I don't use adaptive optimizers
I have tested adaptive optimizers like Prodigy and DAdadaptation a lot and my conclusion is that they simply do not work. When you look up their learning rate graphs in tensorboard you will find that they simply endlessly climb to very high learning rates and either stay there at a constant scheduler or decrease in a constant curve when using a cosine scheduler. They don’t adapt in the sense of also going down in learning rate at some point. I found these super high learning rates to also then drastically overtrain and ruin my models. I thus do not recommend these optimizers at all.
Why I train the text encoder
Too many people recommend not training the text encoder, but I heavily advise against that! I found that training without the text encoder on to not follow prompts at all, and to reduce likeness a lot unless you heavily overtrain the unet. SDXL was trained with a TE for a reason and so you should too.
Why I don’t use regularization images
I found regularization images to never work. They do not improve my model training at all and only add training time. Furthermore, they were initially recommended to be used for a specific Dreambooth method with very few training images, but that is not the case here. In my opinion regularization images are a completely outdated training method that at best does not do anything but increase training time and at worst makes the output worse.
Loss is useless
I find the loss graphs in the tensorboard to not correlate with how training is going at all. Without equivalent validation happening at the same time (which does not happen) these graphs are just useless.
Steps vs. repeats vs. epochs
A ton of people calculate optimal training by steps and recommend others to do the same. However, I have not used steps for this since my earliest training days. Steps are simply not an accurate way of training. Say you train a model for 10 steps with 9 images. This means each image will be seen once, except for one image that is seen twice. So suddenly your training is no longer balanced. Also, I find that optimal step counts vary much more based on how many images you train compared to repeats and epochs. Obviously those vary too, but less so I find. Also, repeats and epochs are always for the whole dataset so the above issue doesn’t happen. Last but not least, steps are and usually should be an end result of your training images times repeats times epochs (ignoring batch size here), but a lot of people align their models to the step count instead of aligning the step count to their model.
Why I don’t use autocaptioners and autodownloaders
I manually select and download all my images, and manually caption them as well. This is an extremely time consuming process, but I find that it produces the best quality. Autocaptioners are either fast but bad and inaccurate, or they are good but take too long (usually because they still require manual input). Also, with my token method you need to do at least some manual captioning anyway.
As for autodownloaders, I find that it is better to use fewer but higher quality images, than more but lower quality ones. Using autodownloaders usually results in the latter. It also has a lack of control over what images you use for training as well.
Why I don’t use danbooru tags
I simply do not think that they are superior to normal language captions. SDXL was trained on the latter and since I do not use autodownloaders to download a bunch of images from Danbooru, my images do not come with pre-created danbooru tags anyway.
If this guide worked for you and you like what I am doing, I would greatly appreciate donations to my Ko-Fi! Testing so much and training so many model's costs a lot of money.