Character training settings and tips (sdxl, chroma, flux, zit, zim)

Updates:

added good chroma1-hd config
added good z-image config

I put all my best training jobs in the resources section of the article for quick downloads.

Below are just some helpful tips I found out to work. I'm still learning myself and am writing this mostly for myself so I can reference it from time to time. Feel free to recommend any improvements.

Putting down this markdown for my future self as well as for others. If you have more tips put them in the comments and i'll add them to the article.

Also this is an evolving article as I learn more stuff I will update it (as time allows).

I used the following to train sdxl, flux 1d, chroma and now klein 9b. So for these I can vouch this approach works.

Attaching a set of files from one of my trains for context. The set is for k9b but the principle is the same for all of them.

This is how to train a character with https://github.com/tdrussell/diffusion-pipe

If you need to read more what each thing does and what other options are available read the main_example.toml and dataset.toml files from the examples dir: https://github.com/tdrussell/diffusion-pipe/tree/main/examples

Now, on to the actual points.

Prep work
1. Use short captions + a trigger word. Dont caption anything about the character except stuff you want to change later. Stuff you want the model to learn you keep out.
  So for example something like 'realAnn wearing a red dress sits in a chair. The background is a living room with a large window.'
2. Images: use at least 20 images and ideally in different poses (standing, sitting, lying on side etc) and caption the pose. If you just have standing photos the model will struggle with anything else you prompt for
3. For face consistency: crop out around 8 of the best faces in your dataset and then upscale them to higher res using something like Seedvr or any other tool you prefer. Take those images and along side around 8 of the most critical images you want to anchor your character with (where the proportions and face all look right) copy them to another folder called 'anchors' or similar. If you're heavily invested in chest size / shape you can repeat the process with upper body crops.
  Here's the key thing: in your dataset you will set the number of repeats for this anchor directory to double that of the normal dataset. This does wonders for character consistency
4. If you want to be able to prompt expressions for your character, use one of your best faces and use nano banana to generate at least one photo of each expression.
  If you're in the EU it might refuse so I recommend using a US vpn and a prompt liike this one:
  "Im working on an AI influencer and need some various facial
  expressions; give me a face portrait with her winking at the viewer while smirking".
  Now important: for the special expressions, you want to caption them. So a good caption will look something like this: "realAnn winking at the viewer while smirking. Close-up head portrait. Grey background".
Dataset.toml config
Create two directory entries:
i) the whole dataset gets a repeat of 1
ii) the anchors dataset gets a repeat of 2
Training parameters
1. Effective batch size (EB):
  - using batch size normalizes the model's understanding of the concept (or character). if you trained loras that gave you wildly different proportions depending on the prompting camera angle or pose this is what I'm talking about.
  - in practice, if you have a solid dataset consisting of the same person for example in the same stage of their lifes (face and body are consistent), you can try setting grad accumulation = 1; 2 is the safe baseline
  - if your dataset is inconsistent (face, body varies across poses and angles), use grad accumulation 4 to help smooth out; mind you, the more you increase the batch size, what you get is more of a general idea of a character and less very specific character details.
2. Learn rate and lora rank:
  Rank determines how much a lora can store; realistically you can do 16 / 32 / 64 /128.
  The question is how much from your dataset you actually want to infuse into your gens.
  Is it just the geometry of the character (face, body proportions)? 8 is enough.
  Want to get the skin textures and some of the lighting (general mood) from your dataset? Use 34 or 64 (you need at least 50 images for 64 from my experiments).
  The trade-off here is that the larger the lora rank the faster it will start memorizing your actual photos and lose generalization capabilities possibly and start degrading before your character has converged. On the flip side, if you use rank 8, you can hammer at it for 15,000 steps and the model will still work.
  
  For Chroma learn rate i've been using Prodigy lately and have been getting very consistently good results with it so I recommend setting it like this (50+ images):
  
  [adapter]
  type = 'lora'
  rank = 64
  dtype = 'bfloat16'
  
  [optimizer]
  type = 'Prodigy'
  lr =1
  betas = [0.9, 0.99]
  weight_decay = 0.01

Tips:

you can stop the training at a) at step 500 to confirm the model is actually learning then add --resume_from_checkpoint to continue. This is to make sure you didnt fumble something from the start.
go through the pain of installing flash-attention; it can be frustrating especially on 5xxx archs but it makes everything go 30% faster after