How to Train Textual Inversions/Embeddings

Unlike txt2img generation, training is a very obscure process. Many of the features are poorly documented, experts disagree on some of the technical aspects and, most importantly, it's very difficult to get clear feedback on any given approach or setting. It's not feasible to run the entire process many times, tweaking one detail at a time, to try to learn from the output.

As a result, people have different approaches, based on personal experimentation and assumptions.

There are three main factors that determine the quality of an embedding: the images used, the descriptions of those images, and the settings for the training.

Of these, the settings are the most difficult aspect to know how to get right. I watched this video by user Aitrepreneur, tried out the settings, and got very good results immediately. I continue to use those settings, with only minor changes which I explain below.

I'll walk you through the whole process the way I approach it, but check out other people's approaches as well and find something that works for you.

1) Select subject

Not all facial features translate well to SD. Not all faces will look right in every setup. For some celebrities, good textual inversions are already available. For others, good pictures are not available (see next step). Choose a subject you believe will be worthwhile and feasible.

2) Select images

You are primarily training facial features, but you must also give SD some idea of the person's overall look, body shape and posture. Choose primarily close-ups of the face but include some cowboy shots (waist up). Full body rarely works, because the face is too small.

All pictures should be high-resolution, sharp and clear with facial features clearly visible. It's OK to have one or two black-and-white pictures (which should be labeled as such in the description), the rest should be in color.

(These requirements make it easy to train a good embedding of a current Instagram model and hard to train one of a model from the 90s or earlier.)

Don't use pictures that are confusing, where the subject blends into the background, with too much going on, other people in the frame (except well in the background), too much weird stuff lying around etc.

There should be no watermarks, writing or other superimposed elements (again, Lama Cleaner is great for removing those). You want natural-looking photographs (unless you're training some other concept, obviously). Occasional writing on clothing or billboards is OK.

Select a variety of angles, facial expressions, backgrounds and lighting conditions. Include images that stand out and look different from the others. If you show SD only one thing, it will reproduce only (or primarily) that thing when prompted.

Select images that represent the subject in the way you want reproduced (again, with some variety). If you strongly dislike certain outfits or facial expressions, don't include those pictures.

Too few images represent insufficient variety, while too many will blur the concept. I don't know what the ideal number is. I use between 12 and 24. (There is an optional additional recommendation, "number of images = batch size x gradient accumulation steps", that I will explain below under settings.)

3) Modify the images

Remove watermarks and other unwanted elements (e.g. perhaps you can salvage an otherwise great picture by removing or smearing the mess in the background, but don't make it look too unnatural). I recommend Lama Cleaner for this. If the subject is surrounded by other people, crop the picture.

You can go much further here, upscale or even use img2img to create your perfect training database. This can be a great idea for training exotic or fantasy concepts. For training a person, it's not necessary.

4) Resize the images

Images used for training must be of 1:1 aspect ratio. SD 1.5 seems to train best on 512x512 pixels.

Downsizing the images will reduce their quality and make faces more pixelated. This is frustrating but unavoidable. It only makes it more important to select a high-resolution base image.

This website allows you to easily resize the pictures as needed.

Once resized, go over them again. Remove those in which the face is significantly less clear than in the others.

5) Describe the images

Once you have the necessary number of resized images in one folder, describe them.

SD can help you with this. Click on "Preprocess images", enter the path to the image folder under "source directory". Create a new empty folder (called "processed", or whatever you want) and enter the path under "destination directory". Check "Use BLIP for caption". Click "Preprocess". Depending on your hardware, this will take about 30 seconds or so.

BLIP is an img2txt program integrated into SD. Instead of providing a text and getting an image, you provide an image and get a text. It outputs a text file, with a description of each image, with the same name as that image, together with the images, in the destination folder. This format is a requirement for training.

Open each text file one by one. It will give you a good starting idea for format, phrasing and length, but will need editing.

If you are training an embedding of a woman, each file should start with "a woman". Keep this consistent (not "a young woman", "a girl", "an Asian woman" etc.).

I use the following format:

a woman, with {hairstyle}, wearing {outfit}, {facial expression}, {pose}, {additional details}, {background}

Be very careful what you describe and how! Your goal is not to describe every little detail. Make sure SD understands what it is looking at.

BLIP will give you an indication of misconceptions. It may misidentify certain elements or describe elements that aren't there. Remove or correct those. Also avoid and remove redundancies (BLIP will sometimes produce phrases like "wearing a hat with a hat on her head and a white hat").

Here you also have to decide what other elements (facial expressions like "smiling" or "making a funny face", outfits, even backgrounds) you want to be part of the embedding. Don't mention what you want SD to learn implicitly. Explicitly mention what is present in the picture but what you don't want included in the embedding.

This is not a clear-cut process. SD often makes associations that are difficult to anticipate or control. Try to exercise control over what you consider very important details.

For example, if you are training an embedding of a brunette who sometimes dyes her hair blonde, and you want the txt2img results to be brunette, one option is to include only pictures with brown hair. Otherwise, use both, but never write "a woman, with brown hair", and always write (if blonde in the picture) "a woman, with blonde hair". That way, SD will understand that "blonde hair" is not a part of the concept you are training, and it will learn that brown hair is part of the concept implicitly by looking at the pictures.

(By this logic, SD should assume that "a woman" is not part of the concept either, unless of course it's programmed to always consider the first element integral to the concept. Perhaps results could be improved by changing this part of the description somehow, but I haven't tried.)

Leaving out descriptions is one of the easiest ways to mess up an embedding. I've once had wood panels (or what looked like those) in the background in three of the pictures, didn't mention them in the description, and now only get pictures of the subject standing in a wood cabin.

At the same time, overly long descriptions become confusing (SD won't be able to match each element of the description to the right element of the picture). This, too, will mess up the embedding. There is no good way of knowing what the right amount of description is without trial and error (and even then, the whole process remains obscure).

Overall, keep the description as clear and concise as possible and hope for the best.

6) Create the embedding file

You can do this at any time, but at this stage at the latest. Click on "Create embedding". Enter a unique name (not something that might naturally occur as a prompt).

Set the number of vectors, representing the amount of information you want SD to gather and embed. This is another obscure balancing act. You want the core concept (what the person looks like) and not an exact replica of a training picture. Higher vector numbers also require more pictures. For 12 to 24 pictures, I use 5 vectors per token.

Here are the vectors Aitrepreneur recommends for other numbers of images:

10: 2-3

11-30: 5-6

40-60: 8-10

60-100: 10-12

100+: 12-16

Click "Create embedding".

7) Choose training settings

Click on the "Train" tab. Select the embedding you have created (you may need to hit refresh).

Aitrepreneur recommends the following progressive learning rate:

0.05:10, 0.02:20, 0.01:60, 0.005:200, 0.002:500, 0.001:3000, 0.0005

This means that SD will "draw a lot of conclusions" from its first ten looks at the dataset, and draw progressively fewer from the passes thereafter. I find this intuitive: Let SD quickly grasp the general idea and then slowly fine-tune the details (assuming this analogy works). (This progression is also similar to the Karras noise schedule, which I think tends to work the best.)

"Batch size" lets SD look at several pictures at once, speeding up the process. In theory, you want the largest batch size your GPU can handle.

Gradient accumulation lets SD "skim over" the pictures and piece the information together over several passes, reducing VRAM usage. In theory, you want the lowest gradient accumulation your GPU can handle.

I say "in theory", because there are other points of view on this.

I don't know whether "overfitness" can be caused by a too high number of steps. If an embedding or LoRa creates too samey results, I suspect it's usually due to something else (perhaps including a too low number of training steps).

Try out different embeddings created with different settings, and see which ones give you the results you want. For other examples, mv_ai recommends GA:17 at 150 steps, Alyila recommends BS/GA: 1/1 at 1500 steps and JernauGurgeh has increased GA from 2 to 15 and uses 120 - 170 steps.

This paper claims that there is a benefit to keeping the batch size times gradient accumulation steps equal to the number of images.

I explore that concept ("full-cycle learning") further in this article, where I also compare results for low and high gradient accumulation as well as low and high number of training steps.

Here are the formulas I use:

6 x 2 = 12

7 x 2 = 14

8 x 2 = 16

9 x 2 = 18

7 x 3 = 21

8 x 3 = 24 (or 12 x 2 = 24, which doesn't always work due to VRAM constraints)

In your "textual_inversion_templates" folder, create a txt file called "custom_subject_filewords.txt". (You only need to do this once ever, not once per embedding you train.)

This should be the text inside it:

a photo of [name], [filewords]

Then, select this file under "Prompt template".

Set number of steps. I use 3000.

"Save an image to log directory" will create preview images after a certain number of steps. I use 50.

"Save a copy of embedding" will create "savepoints" after a certain number of steps. I use 50.

"Read parameters from txt2img when making previews". I've never tried this. It allows you to make the preview images look better, but that's not your goal. The purpose of the previews is to monitor the training process "unadulterated".

"Shuffle tags" will mix up the prompts during training. For example, if you have "a woman, with a ponytail, sitting in a chair", it will sometimes be read as "a woman, sitting in a chair, with a ponytail". Some people recommend this, saying it improves training. I don't use it, because I worry it will make the prompts too confusing. It also requires a really precise use of commas in descriptions.

"Drop out tags" set to 0.1. This will slightly vary the prompts.

"Deterministic" sampling method.

Before you click "Train Embedding", SELECT THE RIGHT MODEL in the drop-down menu "Stable Diffusion checkpoint" at the top of the UI. Don't train for a bit on one model, and then switch to another during training (unless you deliberately want to create "interesting" results).

Some people train realistic textual inversions on realistic SD models. I use the vanilla SD 1.5 model. My assumption is that other models all work by modifying the vanilla model in a certain way and that this way will translate to textual inversions trained on the vanilla model. This would mean that the embedding will work well with a variety of checkpoints (and this is usually true, to an extent, for the textual inversions that I have trained). In practice, not all textual inversions look good with all checkpoints and may require tweaking.

Double-check everything, then click "Train Embedding".

If you get an error message (e.g. "training finished after 0 steps"), this may be because your batch size is too large (and/or GA too low). I also often get the "variable already in use" error (displayed in command prompt), after having used txt2img with another checkpoint. If this happens, I just run txt2img once (with some very simple prompt) on the vanilla model, and then it will work.

You can safely press "interrupt" at any time; this merely pauses the training. If you use SD for other things before resuming, you may get the above error (again).

If SD is shut down in the middle of training (without manually pausing it), you can resume the training later, but SD may pick up from a previous "savepoint". In this case, you can go to the "textual inversions" folder, find the most recent version of the embedding and copy it into "embeddings". Then select that version in the "Train" tab. This will create a new training session in a new folder and with a new file name. For example, if the training is interrupted forcefully at step 2517, you restart the training and SD wants to resume at 2000, you can copy and use the 2500 file instead. The final output file will then be called "subj3ct-2500-3000". You can just rename it to "subj3ct", and get an embedding that (I think, correct me if I'm wrong) is equivalent to if you had trained it to 3000 in one go the regular way.

8) Monitor the training

If you use the "save an image to log directory" option, you will get preview images during training.

It's good to learn what to expect and to recognize what a "good" result looks like.

You will get some bad results at any stage during the training just through bad RNG. The vanilla model is bad at anatomy and will mess up many preview images in ways other checkpoints using the final embedding will not. Faces at a distance will always be flawed.

In the first few images, you should get the sense that SD has a general grasp on the concept. If you train the embedding of a woman, and get pictures of cars, something is wrong. Instead you should get an amateurish, even childish, but recognizable representation ("I know who that's supposed to be") of the subject.

In later images, you should see more detailed and more accurate facial expressions and an overall look that is more faithful to the training images. This is like watching a child artist grow up and become more precise with the shapes and more confident with more complex compositions. If this doesn't happen, then the training isn't going well. Once you have a good understanding of how preview images relate to final images, observing the learning process through the preview images may be a better way to decide how many steps you want your trainings to take than comparing final images (while keeping the limitations of the previews in mind).

When generating previews with the vanilla model, remember that you are not looking at final results. It's like an abstract representation that makes the training progress more transparent (when compared to loading up performance enhancers that will actually obscure the quality of the embedding). In a sense, you are looking at the training progress "in code". You get used to it, though. Your brain does the translating. I don't even see the code. All I see is blonde, brunette, redhead...

You can check the prompt on which each image was trained (this is easiest for the most recent image, for which the prompt is displayed in the UI). Check which elements of the description SD represents and how (and how prominently) and which one it doesn't. Pay attention to mis- or overrepresentations. This will help you improve your descriptions for the future. If every picture includes the same unintended prop, you've made mistakes in selecting and/or describing the images.

Depending on your GPU and the exact settings, a training process of 3000 steps will take several hours (it takes about 2 hours on my PC). Obviously, you can do other (not GPU-intensive) things in the meantime.

Once it is finished, the embedding will be saved into your "embeddings" folder.

9) Test the embedding

Now you can use it. Try out whichever styles you are most familiar with at first and see how the pictures compare to your previous ones. Try out other styles, checkpoints and whatever else comes to mind.

Any given setup will give you one "interpretation" of the subject. This includes certain poses, camera angles, type of outfit and an overall "feel".

However, a textual inversion of a person, even at a high weight (0.8 to 1), should not change (leave alone set in stone!) the overall composition, lighting, color palette or environments (background). That's the whole point of using a textual inversion rather than a LoRa of a person: high fidelity to the subject's features combined with high flexibility in the subject's overall representation.

I still get certain "default preferences", which I mention when I post the embedding (e.g. "Ichika likes to wear a maid outfit by default"). That's OK (but you should still aim to minimize this effect) as long as you can still easily prompt her to wear a burqa instead.

10) Test complementary resources

There may be other embeddings or LoRas that can help tweak or improve the look that you or other users expect from the embedding. Do some testing and note the especially helpful ones. Similarly, if there are commonly used resources that do not work well with this embedding, those may be worth noting, too.

11) Use the training data to create unique images

A useful exercise during testing, that can also give you great final images, is to copy and paste the image descriptions into txt2img (i.e. "a photo of subj3ct, a woman, wearing this, doing that"), but this time with your usual setup. This is essentially SD's "final exam": You've trained it to create any kind of image involving the subject, but these in particular should give you the kind of results you were looking for in the first place. You'll often get cool details and poses you haven't seen before.