Photorealistic SDXL LoRA tips after creating 10+ models:
SDXL has been around for a bit over 6 months, but many users are reluctant to learn the intricacies of this new model through their own trial and error. Compounding this difficulty is a lack of information available on what the major differences are compared to training a model for SD1.5 These are some of my personal insights gained from developing and releasing 10 SDXL LoRA models after learning SD1.5.
SDXL LoRAs absolutely have a higher quality ceiling and are capable of much better text and prompt capability compared to SD1.5, but they are also less forgiving to create. Creating a SDXL LoRA requires careful attention to detail in order to obtain a high quality result.
1. Forget BLIP, ViT-g-14-laion2B-s34B-b88K interrogator is my new best friend.
Robust, accurate tagging is essential for SDXL models and less forgiving of mistakes than SD1.5. For best results in autotagging SDXL datasets, use the same interrogator model that was used to develop SDXL 1.0 - the ViT-g-14-laion2B-s34B-b88K model.
- Install the interrogator extension for A1111:https://github.com/pharmapsychotic/clip-interrogator-ext
- Before you run the captioner you MUST* to go to settings -> actions -> click "unload SD Checkpoint to RAM" or you won't have enough VRAM to run the captioner without swapping and running at terribly slow speed.
*Applies to systems with 24GB of VRAM or less.
- Select model "ViT-g-14-laion2B-s34B-b88K" and batch caption your images. Use "fast" captioning, "best" is not worth the time penalty.
- The caption quality often is good enough to not require any manual modification, though a cursory glance at the generated captions is still recommended.
2. Larger datasets create better models
Remember that SDXL is based on 1024x1024 training images, so your training data should be higher resolution than SD1.5! Upscaling your old SD1.5 data (or using it outright) can work, but it will lose significant amounts of quality compared to using images at least 1024x1024. Scale your images down so that they're 1024 in the smallest dimension. A portrait picture will often be (approximately) 1024x1536px, and a landscape picture 1536x1024px. The "Microsoft Powertoys" application on windows is useful for defining presets for resizing images through the right-click context menu.
Take care to curate a dataset that is evenly distributed in terms of subject matter. While it is possible to create a LoRA with very few images, the end model will not be very flexible or accurate; it is likely to have "leaky" weights where undesirable aspects of the training images are combined to regularly produce artifacts unrelated to the prompt in the final output. I recommend using at least 30 diverse images, and prefer to gather at least 50 images.
3. Learning rates and network dimensions vary by LoRA
As high as 0.0012 for poses/concepts and as low as 0.000002 (2e-6) for subjects. Make sure to also set TE and UNET learning to the same rate as the chosen LR. The most consistent and high quality results are obtained with Constant Scheduler/AdamW8Bit optimizer.
Notes on settings: Use Higher network rank (dimension) settings for likeness (256, 128 or 64), lower dimensions for style/concept (16 or 32). Only set Network rank as powers of 2. Keep network alpha to 1.
General Learning rates:
Pose/Concept: 0.0012 - train for less steps (<2000)
Likeness:0.000002 - train for more steps (>1500)
You can also inspect the settings that were used to create any LoRA with the LoRA Inspector tool. This can be helpful to find what settings werer used to create a particular type of LoRA like a pose or concept.
4. Regularization images are really helpful for training an accurate likeness
And much less necessary for other categories. To get started with regularization images, the free FFHQ dataset is recommended. Clone the github repo, then download the dataset using the download_ffhq.py
script. It is not necessary to download the entire dataset, 10k images is likely sufficient. It will probably take multiple days to download the entire dataset due to bandwidth limitations.
to download 1024x1024 regularization images of human subjects, type in terminal:
git clone https://github.com/NVlabs/ffhq-dataset
cd ffhq-dataset
python download_ffhq.py -i
Finally caption the FFHQ dataset with the interrogator extension from step 1, folder by folder.
5. Use multiple sample prompts when training
This will save you a massive amount of time on generating test images. Copy the exact caption data from 2-4 of the training images to use as sample prompts. Use the same seed, cfg, and resolution for each sample image. Do not set any negative prompt. This is a good way to tell when a model is becoming overtrained, and also when there are issues with sanitary data tags. A sample image should aim to roughly reproduce the major aspects of the training image when using a prompt identical to the image caption, without bleeding in the "unique" aspects of other training images not in the sample prompt. 768x1024 is the highest sample resolution recommended, higher resolutions are likely to run in to swap space and slow training dramatically.
6. Don't be afraid to start over.
If the samples are bad, delete all the epochs from the training run and try again with new settings. Keep the samples from the previous run for future reference, this will help refine the best settings for the LoRA. It's better to start training again right away than try to "prompt your way out" of a poorly trained model.
Overall, SDXL LoRA are more difficult to create than SD1.5 LoRA, but the end results are worth it. Porting a LoRA from SDXL to SD1.5 is also very easy, simply downscale the dimensions of the training data by 50% and run with your preferred SD1.5 LoRA settings. The end result will not be as good as the SDXL LoRA, but it is significantly easier than migrating a LoRA from SD1.5 to SDXL.