Intro

Ok, so fine-tuning seems to be synonymous with training a LoRA these days, but that's not what fine-tuning formally is. This guide is for you if you just want to fine-tune Pony or SDXL the old-fashioned way. Why? It's better than doing a LoRA quality-wise, also you can still extract a LoRA using the Supermerger extension by subtracting the base model from the fine-tuned model.

Requirements

You need a 24GB VRAM GPU, no doubt about it. When training using this guide, my machine uses about 20GB VRAM, and needs about 16GB normal RAM. Pretty sure things could be more efficient, but hey, I'm not a computer detective. If your machine can't manage, just start up a RunPod.io instance, it isn't free, but in the end will cost you like $5 max to repeat all of these steps on there. Don't wanna do that? Use the SDXL Lora trainer on CivitAI for peanuts and merge the LoRA into the base model and call it a day.

These instructions are for Windows but will work on Linux as well.

Install khoya-ss/sd-scripts

Copy Paste from sd-scripts README with my commentary in bold:

Install Python 3.10.6 and Git:

Python 3.10.6: https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe
COMMENT: Just use the windows store to install Python 3.10 if you don't have it already.
git: https://git-scm.com/download/win
COMMENT: You'll need git. It's a thing for downloading code, among other things.

Give unrestricted script access to powershell so venv can work:

Open an administrator powershell window
COMMENT: You'll be using Powershell for the whole thing so maybe familiarise yourself with it. You'll be running commands in the thing like an elite hacker. PROTIP: to paste whatever is in your clipboard, such as sample commands just RIGHT-CLICK.
Also yes, for this step specifically you need the ADMIN powershell, just search for powershell in your windows search bar, and on the right context menu thing you'll see "Run as administrator"
Type Set-ExecutionPolicy Unrestricted and answer A
COMMENT: You need to do this to "activate" a virtual environment using a script. By default Windows thinks you're an idiot so it blocks doing this. This command lets you run arbitrary scripts like the elite hacker you are. Probably don't run random code though.
Close admin powershell window

Windows Installation

Open a regular Powershell terminal and type the following inside:

COMMENT: This will create an sd-scripts folder with everything in it in your home directory. e.g. C:\Users\Advokat\sd-scripts

git clone https://github.com/kohya-ss/sd-scripts.git
cd sd-scripts

python -m venv venv
.\venv\Scripts\activate

pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
pip install --upgrade -r requirements.txt
pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118

accelerate config

COMMENT: This absolutely didn't work for me, so this is what I did:

git clone https://github.com/kohya-ss/sd-scripts.git
cd sd-scripts

python -m venv venv
.\venv\Scripts\activate

So far the same. But I have CUDAv12 installed so I used this command instead.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

For CUDAv11 use

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Don't know what CUDA you got? Use:

nvcc --version

Then run:

pip install --upgrade -r requirements.txt

Guess what, it probably won't install everything you need. If anything is missing do a pip install of whatever it complains about (later on). For me it was:

pip install timm
pip install fairscale

I didn't even bother installing xformers. Don't worry about it, it's way more painful than it should be.

Type

accelerate config

And answer:

- This machine
- No distributed training
- NO
- NO
- NO
- all
- fp16

If it craps out at the end because accelerate already exists in some random huggingface folder just ignore it.

Congrats, you've now installed sd-scripts

If you ever close your powershell window and want to get back to this point open up a new one and type:

cd sd-scripts
.\venv\Scripts\activate

Preparing your images and folders

Ok, so for reference, in this example I have all of my images here:

C:\Users\Advokat\gel-dl\arts\ishikei

Along with the images e.g. image1.png, I have image1.txt which contains image tags, e.g. "1girl, 1huge" etc

How do you get those tags? Well personally I strip them from Gelbooryu etc using a script. But most people use AUTO1111's web ui to generate tags automatically using the WD 1.4 Tagger extension.

To set it up follow these easy steps:

Open Web UI
Got to Extensions -> Available, click "Load from:"
Look for WD 1.4 Tagger in the list and click install.
Go back to Installed tab and "Apply and restart UI" click you will
In the new "Tagger" tab on your webui go to "Batch from directory", paste the path with your images into the Input directory field and press the "Interrogate" button.

That's it.

I've also made a directory here:

C:\Users\Advokat\Training\ishikei

This is where the model will go when training and the json files for the images.

Make caption files

Let's get started. Here we will make some really dumb captions for the hell of it using the BLIP captioner. For reasons not clear to me SDXL has two encoders. One of them seems to be for tagged content a la Danbooru, the other for natural prompts. I think?

python .\finetune\make_captions.py C:\Users\Advokat\gel-dl\arts\ishikei\

Obviously replace the folder used with your own.

This will create .caption files matching the images in your image directory.

Now merge these captions to a metadata file:

python .\finetune\merge_captions_to_metadata.py C:\Users\Advokat\gel-dl\arts\ishikei C:\Users\Advokat\Training\ishikei\meta_cap.json

Again, do a search and replace to put your own directory in there instead of my training one.

Now merge the tags (1girl, 1huge, etc) with the captions:

python .\finetune\merge_dd_tags_to_metadata.py C:\Users\Advokat\gel-dl\arts\ishikei --in_json C:\Users\Advokat\Training\ishikei\meta_cap.json C:\Users\Advokat\Training\ishikei\meta_cap_dd.json

Make the captions less stupid:

python .\finetune\clean_captions_and_tags.py C:\Users\Advokat\Training\ishikei\meta_cap_dd.json C:\Users\Advokat\Training\ishikei\meta_cap_dd_clean.json

now... PREPARE THE LATENTS!

python .\finetune\prepare_buckets_latents.py C:\Users\Advokat\gel-dl\arts\ishikei C:\Users\Advokat\Training\ishikei\meta_cap_dd_clean.json C:\Users\Advokat\Training\ishikei\meta_lat.json C:\Users\Advokat\SD\sd.webui\webui\models\Stable-diffusion\ponyDiffusionV6XL.safetensors --batch_size 4 --max_resolution 1024,1024 --mixed_precision bf16

Note that:

C:\Users\Advokat\SD\sd.webui\webui\models\Stable-diffusion\ponyDiffusionV6XL.safetensor

is the path to my Pony.

This will generate a bunch of matching image .npz files with the latents. And the meta_lat.json file will be good to go for training now. That file contains the image metadata that we've constructed.

Fine-tune command

Here's the command I use:

accelerate launch --num_cpu_threads_per_process 1 sdxl_train.py --pretrained_model_name_or_path=C:\Users\Advokat\SD\sd.webui\webui\models\Stable-diffusion\ponyDiffusionV6XL.safetensors --in_json C:\Users\Advokat\Training\ishikei\meta_lat.json --train_data_dir=C:\Users\Advokat\gel-dl\arts\ishikei --output_dir=C:\Users\Advokat\Training\ishikei --train_batch_size=1 --learning_rate=5e-6 --max_train_steps=5000 --gradient_checkpointing --mixed_precision=bf16 --save_every_n_steps=500 --save_model_as=safetensors --keep_tokens=255 --optimizer_type=adafactor --optimizer_args scale_parameter=False relative_step=False warmup_init=False --cache_latents --lr_warmup_steps=100 --max_grad_norm=0.0 --max_data_loader_n_workers=1 --persistent_data_loader_workers --full_bf16 --lr_scheduler=constant_with_warmup

Notes:

sdxl_train.py is the script for training SDXL models. For SD1.5 you'd use fine_tune.py
Pretrained model path is the path to your base model. In this case, Pony.
I use a train batch size of 1 otherwise my machine runs out of VRAM.
The learning rate is 5e-6, but can also be 1e-5 or whatever, experiment, but those are good starting values
Max train steps is the maximum of steps you want to train. From my experience you want to train somewhere around 2x as many steps as you have images. This depends on the kinds of images you have. If they're from the same artist 2x is enough, otherwise the model will overfit and go bonkers. If there's variety in your artwork you can train for longer.
Because I'm paranoid I save a copy of the model every 500 steps. You can probably change that to something like 1000, or more steps.
The optimiser settings reduce the amount of VRAM needed to do the fine tune.
Add --train_text_encoder if you also want to train the text encoders. I usually don't because I somehow end up breaking the text encoder historically and end up having to replace it with one that's not as broken anyway. I never said I knew what I was doing ok?! Also training the text encoder increases your VRAM usage.
Add --sample_sampler="ddim" --sample_prompts="yoursamplepromptfile.txt" --sample_every_n_steps=100 and put stuff like "1girl, 1huge --w 1024 --h 1024 --l 8 --s 20" in the text file for a 1024x1024 image of CFG 8 and 20 steps.
Use --shuffle_caption to shuffle tags if needed. If you want to have some trigger word set --keep_tokens=1
Use --save_state to actually save the optimiser progress so you can resume later with --resume="savestatefolder"

Hope that's all that helpful to someone.

-Advokat

Fine-tuning PonyDiffusion/SDXL without LoRA the old-fashioned way