santa hat
deerdeer nosedeer glow
Sign In

Can we bundle together lots of Textual Inversion models into one?

For example, one way I've been using to increase the variety of faces in my art is to train a bunch of Textual Inversions on different faces then using the Dynamic Prompts addon for Automatic1111 to randomly pick a TI embedding for a random face. I'd love to be able to offer this to others but hosting tens or hundreds of Textual Inversions then asking people to download them one by one and creating a wildcard prompt file for them all (for Dynamic Prompts) can be a bit tedious.

Not sure if others are doing something similar or there's maybe only a "niche" need for this...

4 Answers

Hey! You can use the Auto1111 extension embedding-inspector to merge Textual Inversion .pt files, but I'm not sure if you'll lose the distinct faces. It might just cut them all together. Would need to test!

I'm also interested in this concept. The more I read about Textual Inversion the more it sounds like the method works best with a narrow focus. So using it to make an embedding for one yoga pose is plausible but it doesn't seem like you could make an embedding with the keyword "yoga_pose" and get one of dozens of random poses. Everyone is just doing faces, faces, faces but it's not clear to non-AI experts what TI is actually capable of doing.

Why not just train different Yoga poses (pose1, pose2, pose3, etc...) and have the AI randomly decide which one to use?


girl in yoga (pose1 | pose2 | pose3 | pose4), doing stretches, light blue top, black leggins

Gonna pop in here to say that because of the way Stable Diffusion processes embeddings, this may not work as well as you hope. Think of TI's as neat little packed up prompts with their own special weights attached. Every embed you feed into a prompt is going to stack in at runtime and increase the size of your final input prompt, and this is where the cracks start to appear. In case you haven't noticed it already, prompts are very front loaded , in that the first tokens have way more effect on the output than say token #65 or token #87. That's why when you're working with style or "setup" type embeds, it's a good idea to put them first so they can lay down a nice thick layer of style before your actual scene setup gets processed. It's all a balancing game, but the more embeds you throw at it, the longer your input prompt is going to be, the less success you'll have at creating a worthwhile output. While merging and combining TI's can work, it's usually of limited success past the first couple.

Your answer