How To Fine-Tune SDXL with kohya_ss and a 24GB GPU

This is more of an "advanced" tutorial, for those with 24GB GPUs who have already been there and done that with training LoRAs and so on, and want to now take things one step further.

This is not Dreambooth, as it is not available for SDXL as far as I know. Instead, as the name suggests, the sdxl model is fine-tuned on a set of image-caption pairs. The output is a checkpoint. Fine-tuning can produce impressive models, usually the hierarchy of fidelity/model capability is: Fine-Tuned model > DB model > LoRA > Textual Inversion (embedding). The advantage is that Fine-tunes will much more closely resemble your training data, the disadvantage is that you need to provide your own captions.

This whole thing did not work for me in OneTrainer, it also seems like OneTrainer does not allow you to train both text encoders. But I may be wrong on both.

The bar of entry is high, you will need a Turing-or-newer Nvidia GPU with at least 20GB of VRAM (e.g. 3090/3090Ti/4090) or access to one. But the fact that we can fine-tune SDXL with both text encoders on consumer cards is still incredible, normally a server GPU like an A100 40GB is required.

You should also have a bit of experience with the kohya_ss GUI, otherwise it may be difficult to follow this tutorial, however I am going to upload an example config that can be simply loaded into the kohya_ss GUI.

First of all, we need to make sure that the "bitsandbytes" package is working. If you are on Linux, it's pretty simple, just set the setup install it for you or check it yourself if you know Python. Bitsandbytes tends to have more issues with Windows, to make sure it is working, create a text file next to the folder called "venv" (in your kohya_ss folder), and paste this into the txt file:

call venv\scripts\activate
call pip uninstall bitsandbytes
call pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui

save the txt file as a .bat file and run it, hit "Y" when prompted to do so.

Now that bitsandbytes should be working, we need the SDXL base model with the FP16 fixed VAE, you can download it here: https://huggingface.co/bdsqlsz/stable-diffusion-xl-base-1.0_fixvae_fp16/tree/main

Go to the "Finetune" tab in the GUI, and load this .json config:

https://files.catbox.moe/8jrwr9.json

As I said before, you need image-caption pairs. Each caption .txt file that accompanies an image needs to have the same name. To do this quickly, create a single txt file and keep duplicating it with CTRL+C & V until you have as many txt files as you have images. Now select all images, and re-name one of them to e.g. "x". You will see that your images are now named in the order "x (1).png", "x (2).png", and so on. Now do the same for the txt files, you should then see that your folder structure is something like "x (1).png", "x (1).txt", "x (2).png", "x (2).txt" and so on. Now you can fill in the .txt files with the captions you desire, just enter what you think the prompt should be for that image.

With full fp16 training and the Adam8Bit optimizer, we can get VRAM useage down to around 21-22 GB of VRAM, just enough to fit onto a XX90 card. The learning rates provided in the config are just a suggestion, but you should know that Fine-tuning usually needs lower learning rates and takes longer than LoRA training.

How To Fine-Tune SDXL with kohya_ss and a 24GB GPU

Comments