Intro
Ok, so fine-tuning seems to be synonymous with training a LoRA these days, but that's not what fine-tuning formally is. This guide is for you if you just want to fine-tune Pony or SDXL the old-fashioned way. Why? It's better than doing a LoRA quality-wise, also you can still extract a LoRA using the Supermerger extension by subtracting the base model from the fine-tuned model.
Requirements
You need a 24GB VRAM GPU, no doubt about it. When training using this guide, my machine uses about 20GB VRAM, and needs about 16GB normal RAM. Pretty sure things could be more efficient, but hey, I'm not a computer detective. If your machine can't manage, just start up a RunPod.io instance, it isn't free, but in the end will cost you like $5 max to repeat all of these steps on there. Don't wanna do that? Use the SDXL Lora trainer on CivitAI for peanuts and merge the LoRA into the base model and call it a day.
These instructions are for Windows but will work on Linux as well.
Install khoya-ss/sd-scripts
Copy Paste from sd-scripts README with my commentary in bold:
Install Python 3.10.6 and Git:
Python 3.10.6: https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe
COMMENT: Just use the windows store to install Python 3.10 if you don't have it already.git: https://git-scm.com/download/win
COMMENT: You'll need git. It's a thing for downloading code, among other things.
Give unrestricted script access to powershell so venv can work:
Open an administrator powershell window
COMMENT: You'll be using Powershell for the whole thing so maybe familiarise yourself with it. You'll be running commands in the thing like an elite hacker. PROTIP: to paste whatever is in your clipboard, such as sample commands just RIGHT-CLICK.
Also yes, for this step specifically you need the ADMIN powershell, just search for powershell in your windows search bar, and on the right context menu thing you'll see "Run as administrator"Type
Set-ExecutionPolicy Unrestricted
and answer A
COMMENT: You need to do this to "activate" a virtual environment using a script. By default Windows thinks you're an idiot so it blocks doing this. This command lets you run arbitrary scripts like the elite hacker you are. Probably don't run random code though.Close admin powershell window
Windows Installation
Open a regular Powershell terminal and type the following inside:
COMMENT: This will create an sd-scripts folder with everything in it in your home directory. e.g. C:\Users\Advokat\sd-scripts
git clone https://github.com/kohya-ss/sd-scripts.git
cd sd-scripts
python -m venv venv
.\venv\Scripts\activate
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
pip install --upgrade -r requirements.txt
pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118
accelerate config
COMMENT: This absolutely didn't work for me, so this is what I did:
git clone https://github.com/kohya-ss/sd-scripts.git
cd sd-scripts
python -m venv venv
.\venv\Scripts\activate
So far the same. But I have CUDAv12 installed so I used this command instead.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
For CUDAv11 use
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Don't know what CUDA you got? Use:
nvcc --version
Then run:
pip install --upgrade -r requirements.txt
Guess what, it probably won't install everything you need. If anything is missing do a pip install of whatever it complains about (later on). For me it was:
pip install timm
pip install fairscale
I didn't even bother installing xformers. Don't worry about it, it's way more painful than it should be.
Type
accelerate config
And answer:
- This machine
- No distributed training
- NO
- NO
- NO
- all
- fp16
If it craps out at the end because accelerate already exists in some random huggingface folder just ignore it.
Congrats, you've now installed sd-scripts
If you ever close your powershell window and want to get back to this point open up a new one and type:
cd sd-scripts
.\venv\Scripts\activate
Preparing your images and folders
Ok, so for reference, in this example I have all of my images here:
C:\Users\Advokat\gel-dl\arts\ishikei
Along with the images e.g. image1.png, I have image1.txt which contains image tags, e.g. "1girl, 1huge" etc
How do you get those tags? Well personally I strip them from Gelbooryu etc using a script. But most people use AUTO1111's web ui to generate tags automatically using the WD 1.4 Tagger extension.
To set it up follow these easy steps:
Open Web UI
Got to Extensions -> Available, click "Load from:"
Look for WD 1.4 Tagger in the list and click install.
Go back to Installed tab and "Apply and restart UI" click you will
In the new "Tagger" tab on your webui go to "Batch from directory", paste the path with your images into the Input directory field and press the "Interrogate" button.
That's it.
I've also made a directory here:
C:\Users\Advokat\Training\ishikei
This is where the model will go when training and the json files for the images.
Make caption files
Let's get started. Here we will make some really dumb captions for the hell of it using the BLIP captioner. For reasons not clear to me SDXL has two encoders. One of them seems to be for tagged content a la Danbooru, the other for natural prompts. I think?
python .\finetune\make_captions.py C:\Users\Advokat\gel-dl\arts\ishikei\
Obviously replace the folder used with your own.
This will create .caption files matching the images in your image directory.
Now merge these captions to a metadata file:
python .\finetune\merge_captions_to_metadata.py C:\Users\Advokat\gel-dl\arts\ishikei C:\Users\Advokat\Training\ishikei\meta_cap.json
Again, do a search and replace to put your own directory in there instead of my training one.
Now merge the tags (1girl, 1huge, etc) with the captions:
python .\finetune\merge_dd_tags_to_metadata.py C:\Users\Advokat\gel-dl\arts\ishikei --in_json C:\Users\Advokat\Training\ishikei\meta_cap.json C:\Users\Advokat\Training\ishikei\meta_cap_dd.json
Make the captions less stupid:
python .\finetune\clean_captions_and_tags.py C:\Users\Advokat\Training\ishikei\meta_cap_dd.json C:\Users\Advokat\Training\ishikei\meta_cap_dd_clean.json
now... PREPARE THE LATENTS!
python .\finetune\prepare_buckets_latents.py C:\Users\Advokat\gel-dl\arts\ishikei C:\Users\Advokat\Training\ishikei\meta_cap_dd_clean.json C:\Users\Advokat\Training\ishikei\meta_lat.json C:\Users\Advokat\SD\sd.webui\webui\models\Stable-diffusion\ponyDiffusionV6XL.safetensors --batch_size 4 --max_resolution 1024,1024 --mixed_precision bf16
Note that:
C:\Users\Advokat\SD\sd.webui\webui\models\Stable-diffusion\ponyDiffusionV6XL.safetensor
is the path to my Pony.
This will generate a bunch of matching image .npz files with the latents. And the meta_lat.json file will be good to go for training now. That file contains the image metadata that we've constructed.
Fine-tune command
Here's the command I use:
accelerate launch --num_cpu_threads_per_process 1 sdxl_train.py --pretrained_model_name_or_path=C:\Users\Advokat\SD\sd.webui\webui\models\Stable-diffusion\ponyDiffusionV6XL.safetensors --in_json C:\Users\Advokat\Training\ishikei\meta_lat.json --train_data_dir=C:\Users\Advokat\gel-dl\arts\ishikei --output_dir=C:\Users\Advokat\Training\ishikei --train_batch_size=1 --learning_rate=5e-6 --max_train_steps=5000 --gradient_checkpointing --mixed_precision=bf16 --save_every_n_steps=500 --save_model_as=safetensors --keep_tokens=255 --optimizer_type=adafactor --optimizer_args scale_parameter=False relative_step=False warmup_init=False --cache_latents --lr_warmup_steps=100 --max_grad_norm=0.0 --max_data_loader_n_workers=1 --persistent_data_loader_workers --full_bf16 --lr_scheduler=constant_with_warmup
Notes:
sdxl_train.py is the script for training SDXL models. For SD1.5 you'd use fine_tune.py
Pretrained model path is the path to your base model. In this case, Pony.
I use a train batch size of 1 otherwise my machine runs out of VRAM.
The learning rate is 5e-6, but can also be 1e-5 or whatever, experiment, but those are good starting values
Max train steps is the maximum of steps you want to train. From my experience you want to train somewhere around 2x as many steps as you have images. This depends on the kinds of images you have. If they're from the same artist 2x is enough, otherwise the model will overfit and go bonkers. If there's variety in your artwork you can train for longer.
Because I'm paranoid I save a copy of the model every 500 steps. You can probably change that to something like 1000, or more steps.
The optimiser settings reduce the amount of VRAM needed to do the fine tune.
Add --train_text_encoder if you also want to train the text encoders. I usually don't because I somehow end up breaking the text encoder historically and end up having to replace it with one that's not as broken anyway. I never said I knew what I was doing ok?! Also training the text encoder increases your VRAM usage.
Add --sample_sampler="ddim" --sample_prompts="yoursamplepromptfile.txt" --sample_every_n_steps=100 and put stuff like "1girl, 1huge --w 1024 --h 1024 --l 8 --s 20" in the text file for a 1024x1024 image of CFG 8 and 20 steps.
Use --shuffle_caption to shuffle tags if needed. If you want to have some trigger word set --keep_tokens=1
Use --save_state to actually save the optimiser progress so you can resume later with --resume="savestatefolder"
Hope that's all that helpful to someone.
-Advokat