Hello all!
I'm back today with a short tutorial about Textual Inversion (Embeddings) training as well as my thoughts about them and some general tips.
My goal was to take all of my existing datasets that I made for Lora/LyCORIS training and use them for the Embeddings.
Before that, I had to find the settings that work best for me. There was some trial and error, there was some feedback from early birds ("alpha testers" if you will :P) and I've pinpointed the settings which I will share in just a moment.
With those settings, I have trained over 800 TIs. The quality differs (I attribute it to the quality of the dataset rather than the training itself), from excellent to good. I have to say that I am very pleased in general with how it turned out.
There are some Embeddings that I will have to retrain after carefully inspecting (and changing) the datasets and I find out more (namely: if this is indeed a dataset issue, or a "difficult" face issue, or maybe actually the training settings issue, which I think is the least likely scenario, but I, of course, cannot dismiss it :P).
Some of the Embeddings I have already posted here (and more will be coming, hopefully maybe even today) but all are available on my coffee page as early access (https://www.buymeacoffee.com/malcolmrey - a shameless plug, I know :P). I also make some news posts there (you can find out for example why I haven't posted anything in a month or so).
First, I will describe my training process and then I will talk about my findings, experimentations, and general feelings about Embeddings.
1. Introduction
You may know me as a mainly LyCORIS model trainer, who also moved over to Loras (and of course some Dreambooths here and there). But I was never really a fan of Embeddings.
If you look way back in my models list (or use the search filter) you will find out that I have made a Textual Inversion (or two) very early on.
My feelings about it were very mixed at the time. Those were times before Lora/LyCORIS, so the advantage of way smaller filesize compared to Dreambooth was a godsend, but the pros ended there.
The quality was subpar to dreambooth and its main other benefit was quite poor at that time - the idea that you could use Embedding with any base model that you wanted.
In my experience - the embedding worked okayish on the model it was trained on but was losing the meaning on other base models.
This was over a year ago and much has changed (for the better!), I will share my current thoughts about the Embeddings in the summary part :)
So as we have established - I was no expert (I still am not, just a person who got some decent settings, however :P) which meant that I had to read and watch a lot of tutorials to see what was available and what I could play with.
There are pretty much two main ways (there are others, but those two seem mainstream) of training - through the A1111 WebUI (which is what I used originally a year ago) and with Kohya SS.
I have talked with some people who do quality Embeddings and they seem to shift towards A1111 WebUI - not only do they get good results in A1111 but they get bad results in Kohya SS.
I found that interesting as I was looking more into those tools and best practices written down by various creators.
I had two main goals - to have a good quality as a result and be able to automate it. I wasn't going to make 800 embeddings manually :)
2. The Training Tools
I decided to go with Kohya SS (knowing that it might be harder to find good training parameters). Eventually, through trial and error, I've found some.
I think the reason people get better results in A1111 is that it requires less time to train so it is easier to train and test more variants.
Since this is Kohya, I can give you the training script directly so you can just tweak the paths and run it :)
3. Training Script / Params
You can find the script directly on rentry: https://rentry.co/bz29cg
but I also paste it here directly:
accelerate launch --num_cpu_threads_per_process=2 "./train_textual_inversion.py" \
--enable_bucket \
--min_bucket_reso=256 \
--max_bucket_reso=2048 \
--pretrained_model_name_or_path="/home/malcolm/sd/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors" \
--train_data_dir="/home/malcolm/sd/embeddings/data/$1" \
--resolution="512,512" \
--output_dir="/home/malcolm/sd/embeddings/output/$1" \
--save_model_as=safetensors \
--vae="/home/malcolm/sd/stable-diffusion-webui/models/VAE/vae-ft-mse-840000-ema-pruned.ckpt" \
--output_name="$1-ti" \
--lr_scheduler_num_cycles="1" \
--max_data_loader_n_workers="0" \
--gradient_accumulation_steps=15 \
--no_half_vae \
--learning_rate="0.005" \
--lr_scheduler="constant" \
--train_batch_size="1" \
--max_train_steps="1000" \
--mixed_precision="bf16" \
--save_precision="bf16" \
--seed="1234" \
--cache_latents \
--optimizer_type="AdamW" \
--max_data_loader_n_workers="0" \
--bucket_reso_steps=64 \
--save_every_n_steps="200" \
--xformers \
--bucket_no_upscale \
--noise_offset=0.0 \
--token_string="$1" \
--init_word="woman" \
--num_vectors_per_token=5 \
--use_object_template
I have this saved in a file "train_ti_generic.sh" so then I can make another script with the following content:
./train_ti_generic.sh evagreen
./train_ti_generic.sh evaheredia
./train_ti_generic.sh evalongoria
It will train three models for me without my manual input (assuming of course that the dataset and output folders are prepared already)
If you are using Windows, you should remove the \
and make it all in one line and change the extension to .bat
The file should go into the main kohya directory.
If you have troubles running it (be it Linux or Windows) - just comment under this article and I will clarify if something is still not clear :)
I will explain some of the options now
a) --pretrained_model_name_or_path="/home/malcolm/sd/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors"
It is very important to use the original 1.5 model for training. With Lora/LyCORIS (and of course Dreambooth) you use in general a good quality (finetuned/merged) model to get better results, but with Embedding you need to use a vanilla model.
This is already mentioned by many guides but this is also what I have experienced myself. Training on a custom model makes the TI good on that model but way worse on others.
In the summary section I will get a bit more into the details of why that is, but for now, just remember to use the base model :)
b) --train_data_dir="/home/malcolm/sd/embeddings/data/$1"
The $1 is shell param (on Windows it would be %1) so for the first training it would be "/home/malcolm/sd/embeddings/data/evagreen"
Inside the "evagreen" folder I have the standard Kohya structure, so in my case, the folder name is "100_sks woman"
To be honest the "sks" token in case of embedding is not important as we will be triggering it by the filename anyway. The "100" is more important as it is used in kohya steps/epochs computation. The "woman" is a class token, in the case of a male I would go with "man" or "person" (as I do in Lora/LyCoris)
Inside that folder are 512x512 images (you can see my Dreambooth training guide/article where the first section is focused on getting and preparing the training data).
c) --use_object_template
This ignores the caption files. Yes, you do not need to clip/blip your images and prepare the text files! For me, it is a godsend because I do not caption my lycoris/lora trainings (I use the instance/class token)
My guide is mainly for creating people, if you want to train a style then you most definitely would want to caption your images so this option here should not be used then.
d) --output_dir="/home/malcolm/sd/embeddings/output/$1"
This is the path to your output folder where the Embeddings will be saved (I keep embeddings in subfolders named by the person and kohya expects the folder to be already there. (I do run an init script per name so that it creates those folders for me automatically).
e) --resolution="512,512"
Since the images are 512x512 I set this option accordingly.
As a side note. I prepared a 768x768 dataset and trained it using 768,768. The training took longer and it required more VRAM but then I gave the 512 and 768 models to friends and they were unable to tell me which is which. That means it really does not matter for the SD 1.5 which was mainly trained on 512.
f) --bucket_no_upscale
Since my images are 512x512 already and I use 512,512 there is no need to upscale anything.
As a general rule - I know certain trainers allow you to train on source images that have varying resolution sizes BUT I prefer to cut the images the way I want.
Do you remember when 1.4 and 1.5 were frequently (to a varying degree) outputting generations that had faces not fully in the shot? Well, if you have to process millions of source images (LAION) and you need them in specific resolution - you will have to cut them automatically, and guess what, sometimes it will mean a face/off scenario :-)
Newer models (finetuned) didn't have that problem in general because the datasets were well-curated and those images were all neat.
To this day I still believe that 80% of the success stems from good datasets.
g) --gradient_accumulation_steps=15 \
--lr_scheduler_num_cycles="1" \
--learning_rate="0.005" \
--lr_scheduler="constant" \
--train_batch_size="1" \
--max_train_steps="1000" \
--optimizer_type="AdamW" \
--num_vectors_per_token=5 \
This is the meat. Those are the training parameters.
If your card does not support AdamW (or has less VRAM) then you can go for "AdamW8bit".
Those are probably not the BEST parameters, but they are quite good in my opinion. The best that I could come up with. I played with different learning rates, schedulers, and training batch sizes so you don't have to (unless you want to :P).
h) --save_every_n_steps="200"
During testing, I was saving every 50 but once I got good results I switched it to 200. As you can see in the previous point, I have max_train_steps at 1000 and I consider that the final result and the go to Embedding.
However, you will get 200, 400, 600, and 800 steps if you keep this parameter. The 800 is fairly good too and perhaps you may want to use it if you desire less of the resemblance for some reason. It is up to you.
During the testing phase, I went as high as 2000 steps but at 1300 I would be getting some overtraining artifacts from time to time (rarely) and those would increase at 1500, and so on.
i) --token_string="$1" \
--init_word="woman" \
So, for me the token_string would be "evagreen", the init word is the same as from the folder structure (woman, man, person) but it seems that the token string is not important since we will be triggering by filename.
j) --mixed_precision="bf16" \
--save_precision="bf16" \
I was training on a better card (3090) so I could go with "bf16" but if you cannot use those then just use the regular "fp16"
k) --no_half_vae \
--vae="/home/malcolm/sd/stable-diffusion-webui/models/VAE/vae-ft-mse-840000-ema-pruned.ckpt" \
Honestly I haven't played with variations here, I just hooked up the official vae. Don't remember the reason for --no_half_vae (was it a default?)
l) --output_name="$1-ti" \
This is the filename for the embedding, for me, it would be "evagreen-ti" and this is what I would be using in the prompts: "photo of evagreen-ti".
I had a discussion with a friend about it and I have tested it, there was no need to make it into some 'sks' or 'ohwx' token or some special variation like 'evgrn' or '3v4gr33n'. You can use the filename just fine and don't worry about anything.
m) The rest of the parameters are either defaults or recommended. No special logic behind them.
4. The Training Itself
Training using this script/params takes me around 20+ minutes on 3090. It would probably take me around 40 minutes on 2080.
So, yeah. It takes a while. It takes longer than training using A1111.
When reading guides and talking with friends I heard about 8-15 minutes of trainings.
When I was tweaking kohya settings to get those times (steps, gradient accumulation, and batch size - those are the main parameters that can affect the training time) - I was getting worse results.
So, yes - it takes longer but the results are good. And since I provide a script that you can batch to run multiple trainings overnight (or when you are away from the computer) - then that time is no longer that big of an issue anyway :)
5. Summary
Here I will describe my general feelings about Embeddings after I have trained 800 of them (and also like 100s trying to get the best params), how to use them, how to combine them to get great results, and how other models affect them.
a) Feelings
As I said at the beginning, I was not impressed with Embeddings. But times have changed.
A small background: Dreambooth/Lora/LyCORIS during training add new data so the training improves the trained concepts. The Embeddings on the other hand do not add any new data, what they do is they guide to the trained concept as best as they can.
This means that if you use a LoRA of person X on a base model that has no idea what person X looks like - you will in general get good results. However if you use the Embedding of person X on a model that does not know what person X looks like, you may have trouble generating that person.
If the base model does not know how to generate a person similar to person X - the Embedding will not be able to guide the generations toward that person.
A very simple example would be: someone in Asia made a new model from scratch and used only photos of Asian people. You would not be able to generate photos of white or black people because that model would not know how to do it.
If you were to make a dreambooth and train a black person - then you would be able to generate that black person because your output model would train the facial features as well as the skin color.
But if you were to use an Embedding, since that Embedding does not add any new data (for example, skin color data) - it would try to guide to our person as much as possible but the base model would struggle to understand what the black skin color is.
So, the current base models that we have are much better than vanilla 1.5 and they know a lot more stuff about people (anatomy, etc) in general but also about more people (in particular, famous people).
This means that the Embeddings are more powerful on those models and can get amazing results there.
To truly show that this is how Embeddings work, there is a very easy test to do. Pick a photorealistic model and use an embedding of a person. You may get decent results out of the box. But then attach a LoRA or LyCORIS of that person AND DO NOT use the trigger for that lora/lycoris.
You will see that the results of the Embedding are much better than before. Because the Embedding always knew what we wanted to achieve and now that we have added additional material that gets closer to that goal - the Embedding picks up that new material and generates a closer representation.
Remember my article about turning it to 11 (by using multiple loras/lycoris of the same person together)? You can turn it to 12 with Embeddings :-)
I do have a default/suggested settings for mixing my models with those Embeddings and the results are IMHO really great. I will make another article about it rather soon and share a lot of examples there :)
b) How to use them / how to combine them
Well, pretty much you add it to the prompt normally. So my evagreen-ti.safetensors would make a "photo of evagreen-ti" prompt. You can of course change weights as with other models so you could go for "photo of (evagreen-ti:0.8)" if you feel like the base effect is too strong.
As mentioned above, you can combine them. For famous people, you can combine it with a real name - as the base models know at least something if not everything about certain people, so a prompt like "photo of eva green evagreen-ti" could potentially work better if the model knows who Eva Green is.
And then you can combine it with Loras/Lycoris for example here are some parts from my wildcard prompts:
sophieturner-ti <lora:lora-small-sophie-turner-v1:0.35> <lora:locon_sophie_v1_from_v1_64_32:0.35> sophie turner,
emilia clarke emiliaclarke-ti <lora:locon_emiliaclarke_v1_from_v1_64_32:0.3> <lora:lora-small-emilia-clarke-v1:0.3>,
madonna madonna-ti <lora:locon_madonna_v1_from_v1_64_32:0.25> <lora:lora-small-madonna-v1:0.2>,
sandra bullock sandrabullock-ti <lora:locon_sandra_v1_from_v1_64_32:0.3> <lora:lora-small-sandra-bullock-v1:0.2>,
In this case, I'm using default weight of the Embedding and a bit of lora and lycoris (notice that due to the embedding usage - the loras/lycoris are at very low strengths)
There is one MAJOR benefit right away that people who use multiple loras in one prompt can already see incoming :)
For example, if you make a simple prompt of "photo of sks woman <lora:locon_emiliaclarke_v1_from_v1_64_32:1>" or "photo of sks woman <lora:lora-small-emilia-clarke-v1:1>" then everything will be fine (well, the quality might vary).
If you combine it according to my "Turn it to 11" article then you would get something like this: "photo of sks woman <lora:locon_emiliaclarke_v1_from_v1_64_32:0.7> <lora:lora-small-emilia-clarke-v1:0.7>".
Please note that we have one prompt but the weights are not used at 1.0 anymore and sum up to 1.4 now. If you put the weights higher and higher - you will experience overtraining artifacts.
This is not limited to only those models. Once you start adding more and more loras (clothing, positions, poses, backgrounds, items, specialty, etc) you notice that the cohesion of the output image suffers and the likeness of the face goes away.
There are various loras out there, some are undertrained and some are overtrained. Some loras were trained for an outfit but they also trained the face with it so using it on someone else can break the likeness (thank god for ADetailer in those cases).
But in general - the more loras you add - the more explosive the combination is and you always have to fight with the weights. Sometimes lowering them keeps the concept in the output but does not negatively impact the rest.
I view those loras as "flexible". Sometimes you can have up to 8-10 loras in one prompt and it all works perfectly fine but sometimes you get 3 loras and it goes downhill already.
So, that mentioned MAJOR benefit is the following: you can use the Embedding and lower the weights of the person loras/lycoris which makes room for other additional loras without (or with less) the risk of breaking the output.
In some cases, the Embedding will be all you need so that leaves the person loras out of the prompt making room for other loras to be added :)
6. Ending
Here is a new post with some of the embedding samples: https://civitai.com/posts/887720
Here is an Imgur collection of those and some additional samples: https://imgur.com/a/JunCQIN
In other news:
I have finished Serenity V2 and I will be uploading it quite soon (this week is most likely). The embeddings work nicely on many photorealistic models but will shine on those models that were already fine-tuned with some celebrity datasets (and yes, Serenity v2 had additional fine-tuning so it is a perfect match)
I will be focusing now also on SDXL training as this is an area that my content is lacking the most (but I will keep doing other kinds of models, so no worries!)
Cheers!
If you found this material interesting or educational or just helpful - you can always support me in my passion over my coffee page :) (which will also grant you access to the early access models, like for example - all the 800 embeddings, early access of Serenity v2 and much more :P)