Reliable diffusion-pipe character training settings and tips (sdxl, chroma, flux, zit)

Putting down this markdown for my future self as well as for others. If you have more tips put them in the comments and i'll add them to the article.

I used the following to train sdxl, flux 1d, chroma and now klein 9b. So for these I can vouch this approach works.

Attaching a set of files from one of my trains for context. The set is for k9b but the principle is the same for all of them.

This is how to train a character with https://github.com/tdrussell/diffusion-pipe

If you need to read more what each thing does and what other options are available read the main_example.toml and dataset.toml files from the examples dir: https://github.com/tdrussell/diffusion-pipe/tree/main/examples

Now, on to the actual points.

Prep work
1. Use short captions + a trigger word. Dont caption anything about the character except stuff you want to change later. Stuff you want the model to learn you keep out.
  So for example something like 'realAnn wearing a red dress sits in a chair. The background is a living room with a large window.'
2. Images: use at least 20 images and ideally in different poses (standing, sitting, lying on side etc) and caption the pose. If you just have standing photos the model will struggle with anything else you prompt for
Dataset.toml config
1. For character lora training only ever use one resolution. If you enable bucketing diffusion-pipe will create a bunch of resized images and feed them to the model. This makes convergence harder since the character looks different at different sizes.
2. Leave rest as is
Training parameters
1. Set gradient accumulation to 2
2. Learn rate and lora rank:
  These are the most important two things to manage and they are closely linked. A higher rank means the lora learns more from each image. A bigger LR means it learns faster.
  
  I've seen many people try to do 64-128 or even 192 ranked loras. That will never work because the lora will learn the lighting and furniture in an image is related to your character at that rank. So you get dank artifacts sprouting all over the place when you want to use it.
  
  Mostly people do this because their character is not converging since they're using bucketing, so it wont fix their issue.
  
  Now, for actual settings that work and make sense I use configs capped at 2000 steps:
  a) rank 16 - 2e4 - 512 res - super fast, finishes in 30-60 minutes
  b) rank 32 - 8e5 - 1024 res - full resolution, 2-4 hours
  
  Running the a) scenario allows me to judge if my dataset works for the character I'm trying to build or if I need to adjust it then I just run b) overnight and enjoy it the next day.
  
  My best result so far has been:
  
  [adapter]
  type = 'lora'
  rank = 8
  dtype = 'bfloat16'
  
  [optimizer]
  type = 'AdamW'
  lr = 2e-4
  betas = [0.9, 0.99]
  weight_decay = 0.01

Experimental:
I've seen some people suggest even rank 4 is enough or even preferable for a character lora because all you want to actually capture is the relationships between different parts of the body. This has some upsides:
- it won't pick up on any style elements because it doesn't have the params to do so
- you can run it with something like optimizer=Prodigy and lr=1 or lr=2e-4 because it will basically never overfit:

- you can try this with both grads 1 or 2 since it cant overfit

[adapter]
type = 'lora'
rank = 4
dtype = 'bfloat16'

[optimizer]
type = 'Prodigy'
lr = 1
betas = [0.9, 0.99]
weight_decay = 0.01

Tips:

you can stop the training at a) at step 500 to confirm the model is actually learning then add --resume_from_checkpoint to continue. This is to make sure you didnt fumble something from the start.
go through the pain of installing flash-attention; it can be frustrating especially on 5xxx archs but it makes everything go 30% faster after
If you do want to pull some aesthetic from your dataset (like photo grain and so on) you can bump the lora rank to something like 128 but lower the learn rate to something like 5e-5 / 8e-5 and use only 1024 resolution -- though be aware this will go much slower.
The zip contains:
1. 9b-mixed.sh: a bash script I use on ubuntu to make re-running the training easier
2. 9b-mixed.toml: the training configs
3. 9b-mixed-dataset.toml: the dataset config