Embedding, Checkpoint, or Hypernetwork?
What is a good way to determine the kind of model you need? At what point is Dreambooth excessive or Textual Inversion insufficient?
Also, is a Hypernetwork a good middle ground? Is that how that works?
6 Answers
Hi kinomi, you're likely to get multiple opinions on this matter but from my perspective here's how I decide when making them:
Textual Inversion embeddings are great for adding concepts to models, so if you have a model that you like and want to add something specific to it, this is the best solution.
Hypernet embeddings are great for adding something that effects the entire image that is created. So I think these are well used for scenes or specific settings for image background and content. They are also useable with multiple models, which makes them very flexible, but they usually have an effect on everything in the image.
Dreambooth is used to create models and if you are looking to get away from the vanilla SD1.5/2.1 base model for a collection of different concepts/themes.. then I'd recommend looking for some cool models to enhance your SD art creation toolkit. Dreambooth models are trained using a base model and an additional collection of images referred to as the dataset. Most people create models to include a specifically themed dataset so they can create images from it. There are also people that mix models as well to combine concepts/themes from multiple datasets. These mixes are great but you need to know that the information in mixed models get slightly diluted for every model that is mixed into it.
There seems to be a lot of confusion around this area and the advice differs greatly depending on what you want to achieve. Hypernetworks, Textual Inversion Embeddings and Checkpoint Models all offer options for obtaining the images you wish to create and the good news is they should all be "inter-operable" meaning they all work together in unison.
While some people swear by one method or another, the truth is that they all act as different tools in your toolkit. You can create a Checkpoint, Textual Inversion and Hypernetwork all from the same dataset using a particular Stable Diffusion model as a base. This will "reinforce" a particular theme or style greatly. In the case of a tight model, that allows you to dial in a theme very strongly with the drawback being you might be restricted to that particular style or theme.
Others might be proponents of the "swiss army knife" which promised the ability to create many varied images from a wide variety of themes and styles. These often require intense effort in the curation and expense in the training process due to the additional amount of time taken to create them. The disadvantage with these is the same as with the vanilla SD models, the AI's attention can "wander" and provide you with far too much variability than you might like.
This is the reason things like Dreambooth had become so popular. Those models were created by training styles and concepts, like particular people or objects. Textual Inversion also became popular as it draws out concepts already in the model by creating vectors it already knows after analyzing the images we train it on. Hypernetworks are yet another useful way to train in concepts without only using the text but also the images too. Most people are searching for a reliable way to have consistent characters, while Stable diffusion excels at giving you a new and unique image every time. Many features on the inference side offer ways to control that and stop it from changing things so much. Seed variance, cross attention, prompt editing and denoise scaling all help to control this randomisation but still you will see clothes change more than people might like should they be creating an original character they want to repeat in many different settings or situations.
There is no real right or wrong answer as all approaches are valid when people have varied use cases. Personally i believe that having as many tools at your disposal grants you a much wider ability to create as you images as you wish. If people want to focus on one tool or another they might risk ignoring a feature that can give them what they want. All the methods available have value and are essential for specific use cases. The Datasets that people are using are the most important as junk in = junk out. As new Stable Diffusion versions are released, those authors will train their models again and again, leading to higher quality and higher resolution outputs.
Aesthetic Gradients were also a great method and still are in SD1.5, but currently it looks like they might get phased out. Nobody really knows why and can only speculate on that. Textual Inversion will only go so far and can be somewhat limiting, hypernetworks also have their own limitations. As Elldreth mentions above you can do model merging, but even that has it's own limitations. All Tools have strengths and weaknesses, so learning them will give you the best chances.
TLDR, try out all the methods and see what works for you, as you might find that using all the above is what you need to get your desired style or concept out in the end. (you can use all three at once, if you like)
Dreambooth + lora so have much time to all elements. And so result on texture inversion or hypernetwork. So fast time and nice result Hypernetwork.