Embedding Training Guide - No Longer Maintained

2024/10/21 - I have modified to add the caption tool I now use and tried to clean up some small miswordings.

I was asked for tips on how I make my embeddings. This may be nothing new, but it is what I use. For now this applies to SD 1.5 training only since i have not tried SDXL training or Flux. It does not sound like Flux embeddings are possible at this time, whether that is due to the T5 encoder or other issues I do not know. This is a rough start and I may add images to illustrate later if there is interest. I will try to rewrite or adjust if and when I try other training tools.

Dataset

When curating a dataset for an embedding I try to end up with 30-36 images for training. I have made models with as few as 18 or 24 images but I feel that 30 gives the best result and too many images may harm the learning. I always make my dataset be a multiple of my batch size though - see the training section later.

When curating, I try to select images not based on how much I like an image but how well I think the system will be able to parse it and capture the subject. I try to avoid things hands on the head or chin, complicated poses or crops that would leave something like a knee in the picture with no context or a bent limb that leaves the frame and returns but you cannot see the joint or continuation. I do not use greyscale or black and white images unless I want the embedding to be that way. I have learned the hard way that it does not take many uncolored images to push the embedding that way.

I usually have a majority of face or upper body photos in that selection, but a few full body or at least upper body photos. I do not feel that full-body legs are too helpful for embeddings unless you specifically want that, the subject frequently ends up too zoomed out. I only want the training to pick up body proportions and it seems to do fine extrapolating from sitting or thigh up photos.

When I crop, I use basic clone and paint tools to remove things like text on clothing, people in the background, people that have been partially cropped out or problematic body parts. Sometimes jewelry or earrings too but not always. This editing does not have to be perfect, I am no photo editor and only use the "paint.net" app (https://getpaint.net/) on Windows. I now try to crop to at least 768x768 to give me more options for newer models, but for SD 1.5 everything gets trained at 512x512 in the training settings.

I send images through the a1111 webui extras tab using "Batch from Directory" to caption with BLIP and convert to PNG. I do not upscale or crop here, as that work has been done previously. After BLIP captioning I now use a tool named TagGUI (https://github.com/jhc13/taggui) to easily edit and standardize my captions across a group of datasets, it allows me to easily search and replace captions across thousands of images if needed and shows me little-used captions. Previously I used an app called Caption King (https://github.com/Jukari2003/Caption-King) to view the BLIP captions of each image and rewrite them. That tool provides crop and resize tools but I did not use them so it was only a way for me to easily edit the caption text for a batch of images while moving back and forth through the dataset to compare. You can also do this manually of course and there may be other better tools available than these two.

For the caption text, I use the generated BLIP captions only as a baseline to see what the model has picked up and focused on. I rewrite all captions from scratch as small elements that I want the training to exclude using the following format with each element separated by a comma. The order of these captions does not matter because I shuffle them during training. I do not worry about overall length of the captions, only trying to keep each one simple. I would imagine it being best to keep the captions under 75 tokens.

Caption Guidelines

basic subject: for me this is generally "a man" or "a woman" - this seems to be the exception to excluding things from captioning, and may work because of the training parameters later. I have not tried excluding it.
clothing: "wearing a white crop top", "wearing blue overalls" - I do not know that every item needs a "wearing a" but it has worked for me.
accessories - generally but not always "wearing": "wearing a gold necklace", "wearing earrings", "red lipstick", "purple nail polish". Sometimes "tattoo" as an accessory if needed.
environment and background: "white background", "fence in background", "window in background", "blue couch", "building in background" - I caption the most here. I try to use colors especially with backgrounds but otherwise not overly describe each element. This is an area where the automatic captions are helpful as I try to describe what the model would think an element is and not what I see it as. "Floral wallpaper" to me may become "flowers in background". A lot of cooking equipment in the background may only be "kitchen in background" rather than "stove in background, sink in background, coffee maker in background" and so on.
BLIP often wants to describe things not actually in the scene. I try not describe anything that cannot be seen - if I cannot see their pants or shoes, I do not describe them.
Generally I do not describe body parts, hair, facial features or nudity. If there is a hairstyle in a specific image that is an outlier in the dataset that I do not want to trigger, I may describe it.

Training

I currently train using the automatic1111 webui training tab. These parameters owe a large debt to the guide by JernauGurgeh which sadly no longer seems to be posted.

Make an empty embedding with no initialization text (text field is blank) and 4 vectors per token. You can also use 2 vectors, but I have switched to 4 as it seems to be more flexible.
Use an embedding learning rate of 0.004, with gradient clipping disabled.
Use a batch size of 3 and gradient accumulation steps equals the number of images in the dataset - 30 for 30 images, or 18/21/24/27/33/36/etc. As said above I make my dataset a multiple of the batch size. If you have a stronger GPU you could try increasing the batch size but keep your dataset as a multiple of whatever value you select. My understanding is this lets every image be trained 3 times for each step when the batch size is 3. If you see people talking about "epochs" this would be 3 epochs per step, so step 120 at batch size 3 would be 360 epochs.
For the prompt template, make a text file with the following: "a photo of [name], [filewords]" - this makes the training prompt end up being the embedding name with everything in the caption text of the image. For A1111 this goes in the "textual_inversion_templates" folder and can then be chosen as the template. Other tools may have other template formats, this is only what works for me.
Train SD1 at 512x512 and leave unchecked the option "Do not resize images". My dataset may be 768 or 1024 but I still want SD 1.5 to resize to train at 512.
Max steps is 200. At worst case you can always go back and train more.
I do not save any images of training, but I save a copy of the embedding every 10 steps. Again if you are not happy with your results you could pick a 10-step copy from along the way and train it more saving after every 1 step.
I do not use PNG alpha, or read parameters from txt2img. I do select "Shuffle tags by ','" and dropout tags at 0.1 which I believe randomly removes some tags for variety. I select "deterministic" sampling.

Evaluation

When done, I take every result from 100 to 200 and start testing those 11 embeddings with simple prompts on a single checkpoint against a few random seeds. I use xyz grids for this and currently start my testing with "Unstable Illusion Final" or sometimes "Analog Madness v7" which to me can have better adherence for words like portrait than Unstable does. The prompt for this is often just the embedding name with not further prompt.
The goal is to eliminate those versions that have obvious flaws - strange eyes, poor dimensions, sometimes a saved step will fixate on a concept from a couple of images like the subject being on a bicycle, or will create animal ears where none were present in the dataset.
After I have removed the first stage of failures, I start adding to the prompt with a simple outfit and background. When I stop being able to easily remove versions, then I switch to using a single random seed but testing against multiple checkpoints. I test this against the same checkpoints that I use to post my previews. These are not always the latest checkpoint versions but they are what work for me. You can find overlap between these models as there has been cross merging, but it is to me a good subset of popular 1.5 models.
Sometimes a clear winner emerges early. Sometimes it takes time. As part of this I may test more complicated prompts and test also against LoRAs. There is no standard test set for this, it is only gut feel on how the embeddings look to me.
If I end up unable to choose a single version, then I will narrow it down to 2 or 3 finalists and move to the "embedding inspector" extension (https://github.com/w-e-w/embedding-inspector) and start merging. I often try for strength variations of 0.3, 0.5, and 0.7. 1.0 often seems too strong but may work well for a good contender that just needs 0.3 of an earlier version. I admit these numbers are not science. Sometimes this only highlights that one of the single steps was actually the better choice.
Finally make simple previews, and convert to a .safetensors file with a tool "Safetensors converter" from this site that I can no longer find posted. Use any method here that works for you.

Example

I have attached the dataset and both the BLIP and manual captions used in training for my old "Ernesto Nobody" embedding. While it was not a specifically popular embedding, it was wholly synthetic of images I generated so I feel free to share the basis images. It is also an embedding where no full-body images were used, and the captions demonstrate my overall methods. The final model chosen was step 140 using the training parameters described above, except that this was only a 2-vector embedding rather than 4.

The file contains the dataset images, the automatic BLIP captions, and the manual captions for comparison.

This whole process can easily take half to most of a day for me for a single model.

Dataset curation and editing: 1-4 hours depending on subject availability and necessary edits
Captioning: 1 hour
Training: 1 hour
Model testing: 1-4 hours