Train a realistic character LoRA
This guide will help in creating a realistic and high-quality character LoRA (Low-rank adaptation) for Stable Diffusion XL (SDXL). The name of the model (the character) in this document is "myLoRA" (replace this name). Let's assume SDXL has no concept of this realistic character.
Note: this LoRA is not going to be able to train on unseen (unknown) tags. This means tags unknown to SDXL cannot be trained in a character LoRA. For this, a separate LoRA is required, and out of scope for this guide.
Prerequisites
This guide assumes Kohya's GUI and Stable Diffusion WebUI Forge are already up and running correctly. Both interfaces can be steep at understanding correctly, but knowledge of these systems are required before using this guide. This basically means a few LoRAs have already been trained and tested. Understanding of colors, pixels, images, and other related terms must be in the toolbox.
Requirements
A computer with a decent GPU and enough memory to hold at least one image per training step is required. This means a modern Nvidia GPU (12 GB or more) or an Apple M2 (64 GB or more).
The following software is also required:
GIMP
ImageMagick
Kohya's GUI
Stable Diffusion WebUI Forge
A text editor
A file manager
A custom SDXL checkpoint (optional)
Rembg (optional)
Collect the images
Ensure the proper permissions to use pictures (images) of the character are in place. Switch to PNG format as quickly as possible; some editing of images is required. ImageMagick can perform batch operations on entire directories (to convert to PNG for instance). Remember during collecting the dataset that garbage in is garbage out. Keep an eye both on the details worth capturing in the LoRA and general aspects of the character.
A few rules apply during collecting:
Quality over quantity
Use high-resolution images
Avoid images with blur or movement
Avoid images with saturation or color issues
Avoid images with perspective distortion
Include full-body, 3/4, 1/2, and close-ups
Diversity (facial expressions, poses, and camera angles) helps
The face is crucial to capture properly in the model. Make sure 1/3 to 1/2 of the images are close-ups of the face. The other images are a mix of full-body to 1/2 shots. If close-up images are a problem, use GIMP to create custom close-ups from the available material. At least 64 images are needed in total; consider going up to 256 images for a more comprehensive character representation.
The background of the character may reoccur in the images. This poses an issue for training, as the network might incorporate it into the model. Use Rembg to remove the background from these images and replace it with a random color (H: random, S: 0.5, V: 0.5). Rembg isn't perfect, and requires some post-editing with GIMP and the original image.
Prepare the data
Preparing the data involves optimizing the images, resizing and cropping them, and adding the required tags.
Edit the images
No image is perfect, so use ImageMagick (for multiple images) or GIMP (for individual images) to address issues with the images that couldn't be avoided. Use sharpening, brightness, contrast, or color tuning to get the images in the same quality range as the other images.
All images must be sized at 1024 x 1024 pixels and have 3 channels (RGB). This involves:
Cropping the image to only have the character in it
Resizing the image to fit within 1024 x 1024 pixels
Extending the canvas size to exactly match 1024 x 1024
Resizing and extending can be done batch-wise using ImageMagick. Cropping, however, requires manual labor with GIMP.
Using buckets in Kohya's GUI
Alternatively, if you have an Nvidia GPU, you can use the "buckets" feature in Kohya's GUI to automate the resizing and cropping process. This feature allows you to group images of different resolutions together, which can help in optimizing the training process and improving the quality of the LoRA.
Steps to use buckets:
In Kohya's GUI, navigate to the "Parameters" section and enable the "Enable buckets" option.
Set the bucket parameters "Minimum buck resolution" and "Maximum bucket resolution" to 256 and 2048.
Create the tags
Kohya's GUI can automatically tag (caption) the images. WD14 captioning provides a head start, but it isn't perfect. Use a text editor, to:
Include tags worth switching on/off in SDXL (prompt)
Include tags to never appear in SDXL (negative prompt)
Keep tags like "simple background" or "border" (negative prompt)
Remove unnecessary tags
Remove double tags
Use the following format in the captions:
subject, switch on/off tags, negative tags, pose, position of hands, camera shot/view
Don't forget that proper tags are at least as important as proper images.
Train the model
Setup the directory
Create a directory for training using the command line interface or file manager. Create three subdirectories:
img
log
model
Since there are no regularization images, the directory is not to be created.
Create a subdirectory in "img" with 1 repeat (an epoch will have no repeats) and the name of the model. For example:
1_myLoRA
Copy the images and captions (*.txt) to this new subdirectory.
Setup the training process
Go to Kohya's GUI and switch to the LoRA tab. Enter the directory with the name of the JSON configuration file and select "Save". For example:
/Users/user/SDXL/myLoRA/myLoRA.json
Then:
Select SDXL as the base model
Go to the Folders tab and enter the image, output (model), and logging directory
Enter the output name
Go to Parameters tab
Train for 64 epochs
Caption Extension: .txt
Save precision: bf16
LR Scheduler and Optimizer: Adafactor
Cache latents
Network rank: 128
Network alpha: 32
Don't upscale bucket resolution: checked
Apple settings:
Train batch size: 1
Mixed-mode: no
CrossAttention: none
NVidia settings:
Train batch size: 2
Mixed-mode: bf16
Full bf16 training
Gradient checkpointing: yes
Memory efficient attention: yes
CrossAttention: xformers
The number of training epochs should be high initially to allow analysis of the models and find the sweet spot (after undertraining and before overtraining occurs).
Select "Start training", wait for the first step to finish, and then select "Start tensorboard". Wait for the training to finish: this may take hours or even a day. Be patient, as this will help in finding the 'best' epoch and model.
Test the model
Copy all created models to the LoRA directory of Stable Diffusion WebUI Forge. Start Stable Diffusion WebUI Forge and load the SDXL checkpoint you want to test with. The base model is fine, but a custom SDXL checkpoint might give better results. However, a new version of a custom checkpoint doesn't guarantee better output with the LoRA.
Go back to TensorBoard, and use the control Smoothing, experience, and the 3 graphs:
Loss/average
Loss/current
Loss/epoch
to find the model(s) to test with Stable Diffusion WebUI Forge. If the dataset is trainable and the training parameters are in order, check out the local minima to select the models to test.
Use prompts from the dataset (captions), standard negative prompts of the checkpoint, custom negative prompts, ADetailer, to get results that can be validated. An X/Y plot with fixed seeds and different models helps to find the details and quality of the ideal model. Take time and experiment to find that model.
Don't give up on the other models selected during testing. Give it a day, and test again. Other details and findings might emerge.
Next step
It's worth training another network (or two) with updated parameters, to fine-tune the quality of the LoRA even further. Start with a lower epoch count (to minimize training time) and take 3/4 or 1/2 of the network rank (including alpha). Also, reevaluate the dataset: remove or add images and/or tags to get the quality of the dataset to a new level. Note the changed parameters of the experiments and compare these to previous experiments. This helps in finding the optimum parameters.