This is a backup of the tutorial available here (to be clear, i am not the author, it's a "backup" :D) in case it disappear from Reddit. Most training tutorials are about creating LoRA, this one is rather complete and is about the fine-tuning of a model checkpoint. :D

More than the parameters and so on, the commands and logic is what i wanted to save :D

Intro

It feels like people who fine tune models don't want to share their process. About a week ago I saw someone on CivitAI asking the creator of an SDXL fine tune about their process and the person was hilariously trying to dodge the question (they ended up saying that they couldn't share their settings, because they don't speak English... an obvious attempt to dodge the question since you can easily copy and paste the command used to start the training process).

If you search YouTube for fine tuning stable diffusion every single video is actually about LoRA training.

u/CeFurkan has said that he will eventually create a video on fine tuning, so that should be good. But in the meantime, this is an attempt to help people actually run the fine tuning script in Kohya_ss.

I can't provide any advice on things like a good learning rate or a reasonable number of images/number of steps ratio. All I can do is regurgitate what I learned from reading the fine tuning read me in the Kohya_ss repo.

THIS TUTORIAL ASSUMES YOU ARE ON WINDOWS

Folder setup

Folder set up is not as strict as it is when LoRA training in Kohya_ss. You don't need to have any special naming pattern for the images folder. Where your training images are stored and where your model is stored doesn't really matter.

For convenience, I'll stick close to the LoRA setup though. So we'll put everying in a folder called fine_tune_job and inside that folder we will put another folder to hold our images. These are the images with which we want to fine tune a model. We'll call the folder images_for_finetuning.

So lets say you want to fine tune a model on a bunch of images you have of dogs wearing glasses. You'll put these images in images_for_finetuning. The contents of the folder:

Captioning and Tagging

With fine tuning you have the option of using both captions and tags during the training. I don't know if there is an advantage to using both or just one or the other. But for our purposes we will use both.

For captions, we'll use BLIP2 and for tags we'll use WD14 captioning (it should be called "tagging" but Kohya_ss calls it "captioning"). After doing that, our images_for_finetuning folder now looks like this:

Each image has a .caption file and a .txt file, where the caption (a single sentence) from BLIP2 is in the former and the WD14 tags are in the latter. (Don't be confused by my file icons, it's because I use Sublime Text as the default application for these file types. Also don't be confused by what a .caption file is. A lot of different file extensions are actually just facades for storing utf-8 and you can open them with a text editor.)

Creating the Metadata file: Adding Captions

I'm guessing this is often the part that trips people up. What's a metadata file and how do I make it? It's just a JSON file that (in the first stages) stores the filename for each image as a key and the caption and tags as values.

To create one, let's assume you have your terminal open in the root directory of Kohya_ss and have your virtual environment activated. To create a metadata file (in its first stage), you want to run the script named merge_captions_to_metadata.py that is located in the finetune subdirectory.

So if your terminal's current working directory is the root directory of Kohya_ss, you would run the command: python .\finetune\merge_captions_to_metadata.py BUT just running the file by itself won't work. You need to pass TWO arguments to it.

The first argument should be the directory where your images are stored. In this case, our images are stored in the directory named images_for_finetuning and this is in the directory called fine_tune_job and let's also assume it's on a D drive. So the full path would be: D:\fine_tune_job\images_for_finetuning .

The second argument should be the path (including filename) for where to save the metadata file. So if we wanted to create and save the file inside of our fine_tune_job folder and we wanted to name it meta_cap.json, then the argument would be this: D:\fine_tune_job\meta_cap.json.

Thus, your full argument for creating the metadata file would be this:

python .\finetune\merge_captions_to_metadata.py D:\fine_tune_job\images_for_finetuning D:\fine_tune_job\meta_cap.json

After running this command, our fine_tune_job directory looks like this:

And here is what our meta_cap.json file looks like:

Notice that it doesn't have our tags yet. (That will be the next stage.) Also notice that the caption for image-1 is not accurate. The dog isn't wearing a red and white striped shirt. It is easiest, imo, to correct captions here, but keep in mind that any changes you make to the captions in the metadata file won't be reflected in the original .caption file. So the next time you create a metadata file using the command above, the caption will be wrong again. So you may want to use the JSON file as a way to quickly check the captions and then write a script to use this JSON file to re-write your caption files. Or just fix your .caption files and re-run the command to create the metadata file.

Creating the Metadata file: Adding Tags

I won't go into detail here like I did above, since it's basically doing the same thing, but using a different script (merge_dd_tags_to_metadata.py) to look for the tag files. The command in our example case would be the this:

python .\finetune\merge_dd_tags_to_metadata.py D:\fine_tune_job\images_for_finetuning --in_json D:\fine_tune_job\meta_cap.json D:\fine_tune_job\meta_cap.json

With the --in_json flag we are telling it where to find the metadata from the previous stage and notice that with the final argument we are assigning it the same name as we did in the previous stage too. So it is overwriting the file. If you don't want it to be overwritten for whatever reason, just choose a different name. But with things like this our directory should now look exactly the same as in the previous step, but the meta_cap.json file should now look like this:

[Optional] Cleaning the Metadata file

(I mentioned above that using both tags and captions is optional, so technically one or the other of those steps could be marked optional as well.)

This is a Kohya_ss script that unifies pronouns. Maybe it does other things too, I don't know. But it will change stuff like "a young man" or "an old woman" into "a boy" and "a girl" respectively. That may or may not be useful to you and definitely isn't in our case. Here's how you would run that script though:

python .\finetune\clean_captions_and_tags.py D:\fine_tune_job\meta_cap.json D:\fine_tune_job\meta_cap_cleaned.json

Notice that in the second argument, we create a new metadata file instead of just overwriting the old one. This will be more useful for you if you want to compare the changes. Also notice that it operates on the metadata file and not on the original captions.

Creating the Metadata file: Creating Latents, Adding Dimensions

This final stage creates a latent for each image that is saved in an .npz file (which is just a numpy file) and then the dimensions of the image are saved to a metadata file. I'll provide the full command first and then point out some of the arguments:

python .\finetune\prepare_buckets_latents.py D:\fine_tune_job\images_for_finetuning D:\fine_tune_job\meta_cap.json D:\fine_tune_job\meta_lat.json <path to some model> --batch_size 4 --max_resolution 1024,1024 --mixed_precision bf16

So we are running the script prepare_buckets_latents.py and then the first argument is the directory for where it can find the images. The second argument points to our metadata file. The third argument creates a new metadata file called meta_lat.json (I think you can just overwrite meta_cap.json if you wanted). The fourth argument I've set as a placeholder. It should point to a Stabe Diffusion checkpoint. So, if you have Realistic Vision saved in your Auto1111 directory on your D drive, then the fourth argument should actually be this: D:\stable-diffusion-webui\models\Stable-diffusion\realisticVisionV40_v40VAE.safetensors.

Finally, we set flags that should be self-explanatory. I'm not sure what all the flags are that can be set here.

After running this command, our images_for_finetuning directory will look like this:

And our meta_lat.json file will look like this:

Run Training

The only thing left now is to run the training. As I mentioned at the start, I can't provide advice regarding the parameters. Would be great if others could chime in with what they've had success with. This command just uses the defaults in the Kohya_ss readme and substitutes the assumptions of our example case:

accelerate launch --num_cpu_threads_per_process 8 fine_tune.py --pretrained_model_name_or_path=D:\stable-diffusion-webui\models\Stable-diffusion\realisticVisionV40_v40VAE.safetensors --in_json D:\fine_tune_job\meta_lat.json --train_data_dir=D:\fine_tune_job\images_for_finetuning --output_dir=D:\fine_tune_job --shuffle_caption --train_batch_size=2 --learning_rate=5e-6 --max_train_steps=10000 --use_8bit_adam --xformers --gradient_checkpointing --mixed_precision=bf16 --save_every_n_epochs=4

SDXL model fine-tuning - Reddit post backup