The Hidden Truth About Textual Inversions in SDXL: Why Your Embeddings Need to Be Model-Specific

Have you ever spent hours crafting the perfect textual inversion (embedding) in Stable Diffusion XL (SDXL), only to be baffled by wildly inconsistent results when you use it across different models? I certainly have. For months, I was frustrated by embeddings that sometimes worked like magic and other times, well, not so much. It turns out, I was missing a crucial piece of the puzzle: embeddings aren't universally compatible across SDXL models. Specifically, I've been experimenting with SDXL models and finetunes derived from the Initium2 base model which is an alternative model built on the SDXL architecture. This makes the need for model-specific embeddings even more important, as Initium2 has its own unique training and latent space. I recently encountered a particularly difficult issue when attempting to load a specific Stable Diffusion model, NoobAIXL_vPred10Version_1095888.safetensors, into Automatic1111. This challenge took me on a deep dive into model compatibility, and the nuances of various SDXL training methods. This experience taught me even more about the importance of understanding why embeddings must be model specific, which I will also explain in this article. I've been experimenting with ways to create my own textual inversion embeddings, and while some are doing training methods, I've been creating them directly from text inputs with a tool called stable-diffusion-webui-embedding-merge.

This article is about sharing my "aha" moment and what I've learned about using embeddings in the SDXL space and why you need to start creating model-specific ones if you want reliable results!

The Secret Inside SDXL: Two CLIP Encoders, Not One

SDXL is a powerful image generation tool that achieves amazing results thanks to its unique approach. One key aspect of this is that unlike older Stable Diffusion models, SDXL uses not one, but two CLIP text encoders. There's the smaller ViT-L/14 and the larger ViT-G/14. When you train an embedding on SDXL, it doesn't just create one set of learned vectors; it creates two, one set of vectors for each CLIP encoder.

It's the larger CLIP (ViT-G/14) that primarily shapes the image generation and is the most important for our embeddings. This is the main area we want to target for making text shortcuts to specific art and style concepts!

Fine-tuning: Reshaping the Landscape of the Model

Now, here's where things get interesting. Many of us don't just stick with the base SDXL model. We often use fine-tuned versions. Think of those models as specialized tools, each designed for a particular style or purpose. But here’s the catch: when a model is fine-tuned, it doesn't just adjust the UNet (the image generation part). It also modifies the weights and biases of the CLIP encoders.

This means each finetuned model, even if based on the same initial SDXL base, actually has a reshaped version of the model's latent space. Think of this space as a conceptual landscape where words and images are mapped to vectors. These changes completely affect how CLIP interprets the text prompts that you give the model, and will affect your embeddings in a large way. Each finetune creates its own version of these CLIP spaces, and they all become individual "worlds," even if based on the same original base model! However, it's important to remember that the extent of these CLIP modifications varies across different SDXL finetunes. Some finetunes, like PonyXL, may make more subtle changes to the CLIP encoders, while others like Illustrious XL or NoobAI XL are designed to significantly shift the CLIP's understanding of text.

Why Your Embeddings Are Actually Model-Specific

This is the moment where the penny dropped for me. Textual inversions (embeddings) aren't independent entities. They are model-specific shortcuts within the latent space that are designed to create specific visual concepts or effects. The vectors created when making an embedding are designed to be a shortcut for that particular model and are tailored to that latent space and it's interpretation of the text being embedded. When training an embedding you are teaching the model to recognize how to interpret a text concept into an image.

When you make an embedding, the model learns how to map your text into its own model's "space". It's like crafting a key for a specific lock. If the lock (the model and its CLIP encoders) is changed, the original key (the embedding) will no longer work. If you try to use an embedding trained on one model with another model (even if it's a different finetune based on the same base) you'll run into problems. The embedding simply isn't calibrated to the new latent space of that model. This mismatch is what leads to inconsistent or even unpredictable results.

My Mistake: The Universal Embedding Myth

For the longest time, I thought I could train an embedding on any model, and it would work equally well across all other SDXL models. It was my misunderstanding that the "concept" is what makes embeddings work across models. I had been working on an assumption that embeddings worked in a vacuum, and the output was only related to the vector. It wasn't! The output is the vector and how that model's space is arranged! It turns out, I was creating my embeddings incorrectly, without realizing that each model is a new environment with different "rules".

The Solution: Model-Specific Embedding Creation

The solution, as it turns out, is quite simple: create your embeddings on the exact SDXL model you intend to use them with. It's a bit more work, yes, but the results are absolutely worth it. For example, if I'm making artwork with the Illustrious base model, I'll make an embedding for that specific model. If I'm working with an ANZCH finetune for ponies, I'll make a different embedding specifically for that model, and I'll make another one for the ANZCH illustrative model! Similarly, if I am working with PonyXL, I would still make specific embeddings for that model, even if its CLIP changes are less pronounced than other models. Conversely, it's especially important to be model-specific with models like Illustrious XL or NoobAI XL due to their significant CLIP alterations. While some people "train" embeddings by feeding the model images, the tool I use generates embeddings directly from a string of 72-75 tokens, converting these tokens into vectors that act as a kind of textual inversion. The specific tool I've been using creates embeddings based on the active model, and so I've been sure to load up each target model before creating any embeddings with this tool. This is because the tool creates vectors in the latent space of the active model, making them model specific by default. It's important to remember that this is even more important when using a different base such as Initium2. Embeddings made for the standard SDXL base model will likely not work well with Initium2, or the finetunes of Initium2 due to their different ViT structures. Likewise, if you are making a model from an Initium2 base, and you are using the standard SDXL embeddings, those will also be less effective due to the differences in model structure.

This is how we are making shortcuts that are tailored for that model's latent space, not a universal concept. The same goes for models like the NoobAI models (e.g. Epsilon) that use different VAEs that train the CLIP encoders differently. It's important to be aware of these differences when you create an embedding. **Knowing these specific details about each model is not about being pedantic. It's about ensuring that when I release an embedding for a model, I have actually trained the embedding to be a specific shortcut for that specific model.

LoRAs and Baked Models

While standard LoRAs primarily affect the UNet, LoRA-CLIPs directly modify the CLIP encoders. If a LoRA-CLIP is "baked into" a model, those CLIP changes become a permanent part of that model. **Baking a LoRA into a model can be viewed as a "mini-finetune," where the LoRA's weight adjustments become a permanent part of the model. Each time you bake a new LoRA (or a combination of LoRAs) into a model, you're essentially creating a new version of the model with its own unique characteristics. This is also true when fusing a LoRA to a model. While a Lora trained and fused on PonyXL may not have a radical effect on the model's CLIP understanding, fusing a LoRA trained on PonyXL onto other models like Illustrious may result in more radical changes. This reinforces the need for model-specific embeddings.

Epsilon vs. V-Prediction

While standard models use epsilon prediction (e-pred) for denoising, which involves predicting the noise itself, some use v-prediction (v-pred), which predicts a version of the original image. The difference between these is not in the CLIP encoders, but rather in the image generation section of the model. This means an embedding may work, but will be less effective, and may not create the results that you are expecting as the model's pathway for denoising the image will be mismatched to the guidance provided by the embedding. It's important to be aware of this when working with models, and you should still make an embedding for the model you are working on for best results. It's important to remember that NoobAI has chosen to use an Epsilon prediction training method for their models.

The Case of the Stubborn Model

This model-specific challenge was highlighted recently when a specific model failed to load due to a structural incompatibility with the tool I was using, as well as the specific choice made by the authors on how to train the model. During this troubleshooting process, I was reminded again that choosing a model that is compatible with the tool I'm using, and also an understanding of the inner-workings of that model is very important. Specifically, I encountered an issue where the model could not find its CLIP text encoder or properly load, due to the specific training method used when making that model. Specifically, while models using a standard Epsilon prediction method are compatible with a number of tools, models that modify this, or that use a completely different model architecture will not be, and this is due to the way that latent spaces are structured between them.

Practical Tips for Model-Specific Embeddings

Double-Check Training Models:

When downloading pre-trained embeddings, verify that they were trained on the exact SDXL model you're planning to use.

Experiment and Take Notes:

Test your embeddings on different models, and keep detailed notes of which embeddings work best with each model. This way you can understand how models are altering latent space.

Don't be afraid to have multiple embeddings of the same concept:

If your workflow involves lots of switching between models, why not have the same embedding for each model for ease of use?

Train, Train, Train:

It can take several attempts of embedding training on the correct model to get the results that you want. When merging embeddings, be aware that the results will contain vectors from different models, and you may wish to create embeddings for each individual model instead. This means that if you are working with Initium2 finetunes, or you are making your own model from Initium2, you will need to train your embeddings on the specific model you are working on for best results. This also includes avoiding complicated math in embedding prompts or any kind of math in embedding text to tokens, as this can lead to a "Tensors Don't Match" error that is specific to how some models are trained.

Conclusion: Consistency Through Model-Awareness

The most important takeaway is that textual inversions (embeddings) in SDXL aren't designed to work universally. They must be specifically created or trained for the model being targeted for. By understanding how fine-tuning changes the CLIP encoders and latent space of SDXL models, you can improve the quality and consistency of your image generation. Now that I'm learning about the intricate details of different SDXL models, I realize I need to completely re-develop my extensive embedding library to be model-specific. This isn't about chasing popularity or trends; it's about my commitment to open-source principles and ensuring that the embeddings I make are compatible across a wide variety of platforms, including ComfyUI, Automatic1111, Re:Forge, Forge, and others. I believe that users should not be constrained by specific tools or platforms, and should be empowered to use the tools and models that best suit their needs. By understanding these differences, my goal is to empower users with the tools they need to get the best results from their AI art journey, no matter what platform they use. So, let's go ahead and get to the work of making model-specific embeddings!

If you're an Automatic 1111 user, this is the tool i've been using for over a year or two now: https://github.com/klimaleksus/stable-diffusion-webui-embedding-merge. In future theoretically they're working on SD 3/SD 3.5 support. Do yourself a favor! Explore, test and find your own way through how these things work, no one way of figuring it out is perfect!**