Why Write This Now?
I realize that for most SD 1.5 is old news. For many it's like comparing "Pong" to "Grand Slam Tennis 2." I'm still using it because I personally prefer TI's to LoRa's (cue the angry troll mafia) and TI's don't like Pony or XL. Also, I can't run XL or Pony locally due to not having a couple of grand lying around to buy a new card, thank you adult responsibilities... This is a personal preference and situation, and this article is not about saying TI's are better, or one AI is better than the other. The intent is to share what I've learned about TI generation for any others interested in still experimenting with the tech.
My Setup
I use an EVGA 2060 6gb card for training... I'm not proud. My UI of choice is Automatic1111 and until the last month I was using version 1.6 due to the bugs in the TI training. Those have been fixed in the latest versions. Sometimes I'm in Chrome for the UI and sometimes in FireFox. Chrome has been giving me memory errors with SD lately and Automatic doesn't like the NoScript extension in FireFox. I've also been using ThinkDiffusion to train on a virtual machine to speed things up. I do train and generate with xformers enabled, it only makes the card squeal a little.
In the Beginning Jernau Taught the hMonk
I learned how to generate TI's through working with @JernauGurgeh. His help and advice was essential and I had my first success following his guide, (in the comments on the model), with some helpful trouble shooting from him via Discord. Several other TI authors have had the same experience and I feel his influence on the community here is still great even after he left.
Summing it up, Jernau's method uses 15 images (sometimes more), focusing only on the subjects face, and follows the settings in his post linked above. I feel that the biggest takeaway from it is the Gradient Accumulation Steps setting at equal to the number of images, and batch size was always at 3.
Jernau's method evolved a bit after he left Civit and he had a guide available to subscribers on KoFi and then Patreon before he had to take down his pages there. I'm not going to go into how his method developed in case he's able to find a new home after his time off.
The reason I liked his method, and sometime still use it, is that it results in a bit more flexible of model. The model trains faster (10-12 minutes on my rock vs 20+ with other methods) and usually appears at a lower step count than I've seen with the other methods mentioned in this article (Jernau's method had them usually coming in at 125-135 vs 140+ on the newer methods). I did have issues with it when the subjects liked to use a ton of makeup and it seemed that certain face shapes just wouldn't train at those settings. Leading to my experimenting.
Data Sets
How my work initially differed from Jernau's was that I varied my data sets by including more half and full body shots. I used this method for all of my early work, from February until late March 2024.
Captions...
In late March I started experimenting. First, I stopped using BLIP for the initial captions, and started using the WD-14 tagger. I found that with this tagging model I didn't have a lot of editing to do afterwards. With only a couple of exceptions, all of my models released after April 1 were captioned with WD-14 with minor edits by me.
I always have it add "a woman" to the tags and remove "1girl, solo, red carpet, magazine cover." That's mostly a personal preference thing, I feel the important ones are removing "1girl" and "solo" since the couple of times I forgot to I couldn't get them to produce a group shot without some serious token weights. This is the current captioning method I'm using. I may do some more experimenting with it to remove certain tags.
Data Set Size
It was also around April that I started playing with larger data sets. Some TI's had 18 images, most 15-21, Agatha Vega has 54... images were cropped to 512x512. With my newest work, I rarely only have 15 images, and it's almost always a data set of 18-24 images per TI. Currently, I try to make 1/3 to 1/2 of the data set closeups of the face with the rest a mix of full-body and half-body shots. I'm also starting to experiment with 768x768 images.
File Type
With data set preparation, I've started to just use JPEG instead of converting to PNG. Not really finding a difference. Just be sure not to try to use WEBP format images, they don't play well.
The Settings
Initial Token Size for the Embedding
Jernau's method uses 8 and I've stuck with that. I didn't see any real benefit by increasing it, but plan to experiment more since my settings have changed so much from that method. Everything that I've published here is an 8 token TI, and the SafeTensor versions are done using fp16 to make them smaller.
Learning Rate
Jernau's method has the Learning Rate set at 0.004. This works, until they get stubborn. I've been using 0.003 and lower recently. Mostly, I start at 0.004. If this fails, I start working the learning rate lower. If it hits 0.001 and still fails, combined with some other tweaks I'll talk about later, then I revise the data set.
Batch Size and Gradient Accumulation Steps
These settings go hand-in-hand from what I've experienced. A batch of 3 and 15 gradient steps works great for certain subjects and what you want to reproduce. I had luck with a batch size of 2 and gradient set to 18, and several other variations. Mostly, I was doing gradient = number of images in the data set and batch size was either 2 or 3 depending on if gradient was a multiple of either number (i.e if gradient was 16 batch was 2 or if gradient was 21 batch was 3).
Recently, I like the method of Batch Size * Images in Data Set = Gradient Accumulation Steps. For example, if my data set is 18 images and I use a batch size of 3, Gradient Accumulation Steps = 54. For training locally, I always use a batch size of 1-3, since that's all my card will handle. Some services that allow for virtual training have enough power to raise that batch size. I haven't played with anything higher than 3 yet.
Failed TI's
If I have a TI fail to reproduce the subject, the first place I'm going to adjust is Batch Size/Gradient Steps. I'll start working it towards 1, adjusting the Gradient Steps as I go. If I hit Batch Size 1 and it hasn't worked yet, I'll go back and work a combination of Learning Rate and Batch Size adjustments. Honestly, most of the time it's been lowering the Learning Rate that leads to success.
There is a bit of a caveat here. 90% of the failures I've seen have been to a data set issue, not settings. It might be a single bad image throwing it off, or something I missed in a caption file, but almost every failed TI that I've had has been fixed by either swapping an image or fixing a bad caption. The other 10% have been something in the settings. Either I messed up when setting up the training run (I have a bad habit of forgetting to swap checkpoints to train); or, like happened with Gabbie Carter (version 1 and the not-posted version 2 have her sometimes generating as African), the combination of Learning Rate, Batch Size, and Gradient Accumulation Steps was off.
Honestly, if you're trying to train and not getting good results with a similar setup, your data set is probably where the problem is.
PickleTensor vs SafeTensor
If you're having issues with the base TI trainer giving you Pickles and wanting SafeTensor for easier posting on Civit (or other sites) SwarmUI has a built in converter. I've been using it to convert mine while encoding them as fp16 models to save space.
My Regular Mistakes
I have a few that I seem to make on a regular basis. Things to watch for so you don't get frustrated:
Forgetting to swap from the models I use to generate images normally to a base model when training.
I do this a lot... version 1 of Katie McGrath was one of these and for one step would only generate a picture of broken concrete with hair.
Forgetting to change both the name of the embedding and data set path.
Throws my whole day off when I get part way through and realize I just trained 10 models with the wrong names/triggers...
Forgetting to caption/number/crop the images in a data set.
Despite what others have said, training without captions has never worked for me.
If you are hand captioning, be sure to separate tokens with a "," or another separator.
Version 1 of Haley Reed failed due to me forgetting to do this and would overpower my background and clothing prompts with whatever was in the data set.
Training a new model with an old trigger.
I'm just lucky I keep a backup folder.
Summary
So you read all that and are ready for the basic version:
Follow Jernau's guide here in the comments for the basic settings (which boxes to have checked on the training page)
Gradient Accumulation Steps = Batch Size * Number of Images in the Data Set
Other methods are equally viable, play around to get the results you want.
Learning Rate = Something between 0.001 and 0.004 depending on the quality of the data set
Initial Token Size = I like 8, others 10 and 12, experiment with what gets you the look you're after.
For the love of all of our sanity, name it something other than a person's name...
If that name exists in the checkpoint, it will throw things off (i.e. Elizabeth Olsen is trained into almost every checkpoint I've tried and if you name your TI "Elizabeth Olsen" it causes problems.)
Triggers should be something that won't be in the checkpoint like El1Ols3nSc4rW1tch (not that I would ever do something that would piss off the Rodent Empire who owns the rights to that character...)
Have fun and realize this is art, sometimes it won't be 100% perfect to everyone's eyes, but if it achieves the look you were going for it works.
Examples of the Differences
This whole article happened when I decided to update one of my most popular and earliest models with the new training style. I was prepping the data and realized it would be fun to see how she turned out under each of the major changes I've done to my training style. All of these use the same 15 image set, but the captions and training settings differ.
Christy White Version 1.0 and 1.5 were both done with the original training style influenced heavily by Jernau. It was trained with a Learning Rate of 0.004, Batch Size of 3, and Gradient Accumulation Steps of 15. Being totally honest, while it looks like her it isn't the best re-production. I haven't been happy with it since I put it out.
Christy White Version 2 uses the same settings, but swaps the caption files out for the WD-14 captions with my edits.
Pretty similar to Versions 1 and 1.5
Christy White Version 3 uses the same captions, but it was trained with a Learning Rate of 0.003, Batch Size of 3, and Gradient Accumulation Steps of 45.
Change in face shape and, I feel, is a bit more accurate.
Christy White Version 4 uses the same captions and settings, but swaps from 512x512 images to 768x768.
This doubled the training time.
Unlike the other versions, tattoos appeared in many of the test generations.