"Fear the curses that hide in your training" - Disclaimer: I won't teach you to make images like this one, don't worry.
This aims to cover a lot of misleading information in the community and to provide solid information for whoever seeks to train a better LoRA. I will aim to cover the different hyperparameters and how to fix the different problems you might encounter while training your concept/characters/style.
This guide will have three different ways to be read. The subsections will be separated into [ESSENTIAL], [BEGINNER] and [ADVANCED].
The [ESSENTIAL] sections are for those of you who only want to train your LoRA and be done with it. The [BEGINNER] sections will provide insights on concepts to know when preparing a LoRA, and the [ADVANCED] sections will explain the inner workings of model training and aim to provide a deeper understanding of Stable Diffusion.
2. Understanding Stable Diffusion models
[ESSENTIAL] Understanding how Stable Diffusion understands "concepts"
A core idea to grasp is that Stable Diffusion already has knowledge of a vast array of concepts due to its extensive training on diverse datasets. When training a LoRA, it's important to take advantage of this and differentiate between "New Concepts (NC)" and "Modified Concepts (MC)."
New Concepts (NC)
These are concepts or elements that are not present or are inadequately represented in the original training of Stable Diffusion. They could be unique subjects, styles, or items the model hasn't encountered before. Training with NCs involves introducing entirely new information to the model. The goal is to expand the model's "understanding" to include these novel elements. You usually add the previously discussed "Activation tags" to represent them in your dataset.
Modified Concepts (MC)
MCs refer to concepts the model already recognizes but might not represent accurately or in the desired manner. These could be variations of existing subjects, styles, or interpretations. Training MCs involves adjusting or refining the model's existing knowledge. The aim here is not to introduce new knowledge but to tweak and refine the model's existing understanding.
When training a LoRA model, it involves understanding Stable Diffusion's base knowledge (aka. what the model already knows well), and what it lacks or misinterprets. Using this knowledge, you will need to curate your training dataset to address these gaps or inaccuracies, whether they fall under NC or MC. Then, you can use activation tags strategically to introduce the new concepts. By approaching LoRA training with this understanding of NC and MC, you can more effectively guide Stable Diffusion to align with your specific vision.
[BEGINNER] LoRA types
As of the writing of this tutorial, there are quite a few types of LoRA: Standard (lierla/LoRA), LoCon, LoHa, DyLoRA, LoKr...
My advice for you: the non-standard types are not worth it. They will take longer to train, for more mediocre results in most cases. For this reason, I will not write about the other types.
Stay with the standard LoRA type.
If you really wish to make a LoHA or LoCon, remember to have the dimension of your LoRA under 32.
[ADVANCED] Understanding Stable Diffusion models
It will give you a great advantage to understand how Stable Diffusion works for making great LoRAs. Let's learn a bit about the different concepts. In essence, you only need to understand a few concepts that are part of the model:
Latent Space & VAE (Variational Autoencoder)
Imagine you have an image, say 512x512 pixels. To represent this image, you need to consider 4 color channels (RGBA) for each of the 512x512 pixels. That's a total of 4 x 512 x 512 = 1,048,576 individual values. Handling such a massive amount of data for every image would be highly inefficient and computationally expensive. To tackle this, data scientists developed methods to reduce the data size of these images, and one effective solution is the latent space.
Imagine a vast library where every possible image you can think of is stored in a compact form. Latent space is like this library for Stable Diffusion. It's a mathematical space where complex data (like images) are transformed into a simpler, compressed form. This allows the model to efficiently work with and manipulate images.
The tool used to make this compression possible is the Variational Autoencoder (VAE). The VAE learns to compress images down into this latent space and reconstruct them back to their original form.
In the case of Stable Diffusion, for an image of size 512x512, the VAE compresses it to 64x64x4 = 16384 values, making the process extremely more efficient.
Text Encoder, Tokenizer, and Embeddings
In Stable Diffusion, images are generated based on text descriptions (prompts). This process involves three critical components: the Tokenizer, Text Encoder, and Embeddings.
Tokenizer: Imagine you're given a complex sentence to understand. The first step is to break it into smaller, more manageable pieces. This is precisely what the Tokenizer does. It takes a prompt and splits it into smaller units called tokens. These tokens can be words, parts of words, or even punctuation. This segmentation aims to simplify the text, making it easier for the model to process.
Example: "Poiuytrezay", is separated into "po", "iu", "y", "tre", "z", and "ay"
Text Encoder: Once the text is tokenized, the next step is translating these tokens into a language the model can understand. That's the role of the Text Encoder. It works like a translator, converting tokens into numerical vectors. These vectors, known as embeddings, are a form of numerical representation that captures the essence of each token, including its meaning and context within the sentence.
Embeddings: Embeddings are where the magic happens. They are vectors that represent the tokenized text in a numerical form. Think of them as coordinates in a vast space, where each point (or vector) not only signifies a specific word or phrase but also its relationship to other words and phrases. These embeddings are crucial because that's how the model 'understands' the text input in a mathematical sense. I highly recommend reading this article (2 minutes read) to understand embeddings as a whole better.
Example: "Poiuytrezay", is encoded by the Text Encoder into "po #628", "iu #14292", "y #88", "tre #975", "z #89", "ay</w> #551", where the numbers represent the embedding ids within stable diffusion
The UNET is a critical component, acting as the primary engine for image generation. Without going into details, all you need to know is that it mixes the embeddings and latent-encoded images into a mathematical soup that outputs a "noise" prediction. This noise prediction can then be removed from the image to "denoise" it. When training a LoRA, this will be the main component you will train, because it is the one making the predictions.
Here is the final process to train a LoRA:
3. Preparation for training
[ESSENTIAL] Dataset & Captioning
There are tons of guides that already cover how to get a good dataset. For this reason, I won't cover this part. This one is a great start, however, it gathers images from screencaps of anime episodes. I'd recommend going with Grabber instead.
"A bad apple spoils the bunch." Remember this when tagging your dataset. If you tag "blue eyes" on a character with red eyes, it will try to mix the two. This will cause the tag "red eyes" to output a mix of red and purple eyes (or even sometimes blue eyes). False Positives (concept tagged but not present) are bad for Stable Diffusion, whereas False Negatives (concept not tagged while present) are not as bad. Please prioritize false negatives over false positives when tagging.
An important concept I will use in this tutorial is the "ACTIVATION TAG" concept. Activation tags act as triggers for specific features or elements in the generated images. When an activation tag is included, it prompts the model to produce or emphasize the corresponding visual elements associated with that tag.
/!\ The activation tag should be a unique tag that represents your concept/character/style and should be present as the FIRST element in the caption files of your dataset. /!\
[ESSENTIAL] Training Script / UI
Any training scripts/UI will have the same parameters with a bit of difference in the features proposed. In this tutorial, I will be commenting on the main parameters of kohya-ss by bmaltais, but know that the knowledge I will be discussing applies to every other trainer.
I will not go into the details to install the scripts. If it is your first time, go to https://github.com/bmaltais/kohya_ss and follow the installation instructions.
[BEGINNER] Understanding LoRA, Dreambooth, and TI (text inversion)
In deep learning, adapting a model to a specific task typically involves a process known as "fine-tuning." This approach leverages a model that has already learned general features from a large dataset and adapts it to perform well on a more specialized task or dataset. However, in the context of models like Stable Diffusion, traditional fine-tuning can be resource-intensive and not feasible for most users.
For us, especially those with limited computational resources, alternative techniques are employed to specialize the model, each with its unique approach:
DreamBooth: not too different from regular fine-tuning, but usually for a single concept. The main difference from regular fine-tuning is prior preservation (see section below).
Textual Inversion: if you followed the previous [ADVANCED] section, textual inversion is a way to help the model learn new associations by creating a new embedding. This embedding allows to generate images that better align with unique or niche concepts.
LoRA: this is a technique that modifies only a small portion of the model's weights instead of the whole. It is less proficient than Dreambooth but uses fewer resources. Currently, LoRA is applied to Dreambooth, which means all regularization techniques used in Dreambooth also applies to a LoRA (see section below).
Pivotal Tuning: think of it as classical LoRA combined with TI. You get new embeddings and the training for them. This is a powerful way to train a LoRA because it removes the need to use "rare tokens" (see Dreambooth section) by preventing "concept bleeding". The downside is that you must train the new embeddings representing your concepts from zero (you can't use the model's prior knowledge, which means it's harder to get right).
Hypernetwork: obsolete, no point to even know about it
[ESSENTIAL] Source model for training
Always use bf16/fp16 pruned model for training. The reason is that the model will be smaller, and thus use less VRAM.
As for your choice, you basically only have 4: NAI, SD1.5, SD2.1, SDXL. These are all base models. Do not use any other checkpoints, as it will make your model less versatile and some concepts might not be rendered as good as with the base model.
As a rule of thumb:
- For realistic use SD1.5, SD2.1, or SDXL (your choice)
- For anime/cartoon use NAI (animefull-final-pruned) or SDXL
During training, your samples will all look ugly. The magic will only happen when you use specific checkpoints in A1111.
Note: using a more downstream model might not be as bad if you use pivotal tuning.
[BEGINNER] Dreambooth in details, Regularization Images, and how it applies to LoRA
What is Dreambooth?
Simple Explanation: Dreambooth is a method that fine-tunes parts of an AI model, specifically the Unet and text encoder, using rare tokens and regularization images to maintain the model’s existing knowledge. The resulting Dreambooth model is a checkpoint.
Detailed Explanation: Text-to-image models, like Stable Diffusion, are initially trained on vast datasets to learn a wide range of concepts. This initial training forms the model’s ‘prior’ - a base knowledge about various objects and ideas (like different animals, vehicles, etc.). In this situation, introducing new concepts can be tricky. Traditional fine-tuning - which involves adjusting model parameters based on new data - might make the model forget some of its original training (known as ‘language drift’ or ‘catastrophic forgetting’). Dreambooth offers a solution to add new concepts while mitigating the loss of the model’s prior knowledge using class-specific prior preservation and rare tokens.
For instance, teaching the model about a new celebrity could inadvertently alter its understanding of other celebrities (if the new celebrity has a chipped tooth that is not tagged, it might alter all teeth to be chipped).
Regularization images / Class-specific prior preservation
Dreambooth uses a special method to keep the model’s original knowledge intact. It involves training the model with images that belong to the same "class" as the new concept but are already part of the model’s knowledge. Here, a 'class' refers to the broad category or group that an entity (like an object, concept, or data point) belongs to.
For example, if you’re teaching it about a specific type of dog (e.g. a golden retriever), you also include general images of dogs generated with the initial model. This helps the model remember its original training about dogs while learning about the new specific type.
Note: your regularization images should be generated using the same model you are training, with no negatives, using the same VAE and resolution as your model, with the DDIM (or DDPM) sampler, and ideally, the same seed.
When introducing a new concept, Dreambooth suggests using uncommon or rare tokens - unique identifiers the model doesn’t strongly associate with anything it already knows. Because of their weak prior associations, it will make the language drift less impactful (since it will lose an association that probably made little sense). This prevents the new training from interfering with the model’s existing capabilities. Examples of such tokens might be: "olis", "bnha", or "hta". To represent your new concepts with a rare token, you might then use the rare token with the class (e.g. "a olis girl", where olis could be any new character you would like to train).
How Dreambooth apply to LoRA?
LoRA needs to work in tandem with an existing fine-tuning method. Among the most commonly used methods are Dreambooth and Textual Inversion. However, since LoRA isn’t compatible with Textual Inversion, it is predominantly applied in conjunction with Dreambooth. The main distinction when using LoRA with Dreambooth lies in how it refines the model: LoRA reduces the overall number of parameters that need adjusting and focuses specifically on fine-tuning the Attention layers of the model. This means it does not alter the entire unit and text encoder of the model.
Practically speaking, the principles guiding Dreambooth also apply to LoRA. Omitting rare tokens or regularization images in LoRA training could lead to 'language drift' – where the model starts losing its original training accuracy. While including these elements isn't mandatory, it’s important to be mindful of this potential issue during training.
A key challenge when working with LoRA is integrating multiple concepts effectively. Some LoRA models end up generating images that resemble a patchwork of different elements from their training dataset (aka. a "collage"). The best LoRAs will not only produce high-quality images but also maintain the broad capabilities of the original model. For instance, if you're adapting a model to a specific celebrity, the aim would be to enable variations of this celebrity (like hair colour changes) without losing the model's ability to recognize or portray other celebrities accurately.
4. Training parameters
Here is a base config for kohya-ss for you to tweak: https://gist.githubusercontent.com/Poiuytrezay1/14d146290a08afedc8ea07d7c79f2049/raw/06ffed6ed9ad75864f4e80d4560fd25d9c4815ac/base.json (Right click -> save as -> import in kohya-ss).
[ESSENTIAL] Training parameters to tweak
The training parameters will mostly be the same for all your runs. Only a few parameters will need actual tweaks:
Clip Skip: use 2 for NAI, otherwise use 1
Learning Rate (LR) - only if you don't use a dynamic optimize like DAdaptation or Prodigy - the rate at which your model will be trained, see the section below
Network Dimension / Alpha: see the section below, TL;DR keep it between 4 and 32 (resp. 1 and 16)
Keep N tokens: this represents the number of Activation Tag in your dataset. It keeps the heading N tags when shuffling captions (here, token is misleading, it refers to TAGS and not tokens). Set it to 1 if you have only 1 activation tag.
The easiest way to go is to grab an existing configuration and only tweak those parameters.
[ESSENTIAL] Mandatory training parameters
For a better guide explaining all parameters individually, refer to https://rentry.co/59xed3 (good write-up) or https://github.com/bmaltais/kohya_ss/wiki/LoRA-training-parameters. Because those two persons have already made good write-ups, I will only give you "recommended" values and let you experiment with "out-of-the-box" training parameters. All "optional" parameters will be left empty unless useful, and I will discuss some of them in the [ADVANCED] section:
Train batch size: Recommended values are between 1-4, depending on how much VRAM you have available. The ideal batch size should be a divisor of the number of images in each bucket. Avoid high values as batch normalization is not implemented.
Epoch: Between 10-100. As a general guideline, 200 epochs may replicate your images as carbon copies. Aim to slightly overfit your dataset and select earlier epochs for optimal results.
Save every N epoch: Set this to 1. It helps in selecting the best epoch later.
Mixed/Save precision: Use fp16 (or mixed precision bf16) for nearly identical quality to fp32, with the benefit of reduced VRAM usage.
Cache latents: ON, this is an optimization to speed up the training at the cost of higher VRAM usage. If VRAM is limited, consider turning it off.
Cache latents to disk: OFF
LR Scheduler: Use "cosine" as an all-rounder. In short, it's used to for faster convergence, but can be catastrophic if done incorrectly.
Optimizer: will be discussed further in the [ADVANCED] section. There are two types: adaptative and non-adaptative. Adaptive optimizers consume more VRAM but simplify learning rate adjustments. If VRAM allows, "Prodigy" is recommended (with mandatory additional arguments; see the Prodigy section for details).
LR Warmup: Set this to 0%. If using Prodigy, consider 2-5% of steps. The exact impact of this setting is complex and not fully explored here.
Max resolution: Stick to your model's native resolution (e.g., 512,512 for NAI, 1024,1024 for SDXL) unless targeting higher details. In that case, increasing resolution may improve details at higher resolutions but potentially worsen lower resolutions.
Stop text encoder training: 0. This is a niche setting and typically not needed.
Enable buckets: ON (keep default params for min/max resolution and steps)
Gradient checkpointing: ON, huge VRAM improvement with zero impact on the quality, might make the training slower.
Shuffle caption: ON
Flip augmentation: OFF for most cases, can switch to ON if your concept is symmetrical and you need a bigger dataset
Min SNR gamma: 5, optimization to converge faster. This parameter would require a whole section to explain in detail.
Noise offset type: Multires (called Pyramidal sometimes), which is better for learning contrast. Same as Min SNR gamma, a fuller explanation would need an entire article.
Multires noise iterations: 6-10 are good values
Noise discount: I haven't done enough experiments on this. I usually use 0.2-0.4. If your images are very dark when sampling, lower this parameter.
[ESSENTIAL] Overfit and underfit
Overfitting occurs when a model learns the training data too well. Underfitting is the opposite, where the model fails to capture the concepts.
In Stable diffusion, you will have signs of overfitting when your images become saturated, full of artefacts, or plain weird. An underfit LoRA is easier to catch because it won't be able to reproduce your concept consistently. To illustrate an overfit case, you can set the weights of your LoRA to a high value. With weights of 2, we see artefacts and oversaturation appear in this Jinx LoRA:
Saturated and full of artifacts, signs of overfitting.
[BEGINNER] How to avoid overfitting?
Techniques like regularization, including dropout and max norm (see below section), or using more diverse training data can help prevent overfitting. Other methods include training until the model overfits, then going back in the timeline.
Note: There are many methods to detect overfitting, one of which is discussed above. Another example make use of cross-validation and analyzing the loss graph. You can also try to adjust the weights of your LoRA (>1.2) a bit and see if it generates a saturated image.
[BEGINNER] Learning Rate
The learning rate is a key parameter in the training of neural networks. It determines the size of the steps the model takes during learning. Think of it as the pace at which the model learns from the data. Too high, and the model might overshoot the optimal solution; too low, and it could take too long to train or get stuck in a sub-optimal solution.
For a stable diffusion LoRA, you will typically get values oscillating between 1e-4 and 4e-4 as the optimal LR. For further reading, I recommend reading https://www.jeremyjordan.me/nn-learning-rate/, which explains a bit more about LR scheduler and learning rates.
[BEGINNER] Network Rank (or Dimension) & Alpha
These parameters are purely linked to LoRA. The higher the rank, the larger the file (from a few MB to almost a GB), and the closer you get to a classical "Dreambooth" training.
Let's try to understand it intuitively:
Imagine an orchestra. Each section (strings, brass, percussion, etc.) adds a different dimension to the music. A LoRA will try to adjust the sections of this orchestra to adapt the sound. This might be by modifying the instruments or adding new musicians. The "Network Rank" is akin to the granularity at which the sections will be changed.
Low Rank: This is like making subtle adjustments in each section (say by adding a small number of musicians). Each musician can subtly alter the sound of the section, but the overall composition remains largely recognizable. A low rank in LoRA means fewer elements are modifying the existing weights of the neural network, leading to a simpler, more controlled adaptation.
High Rank: Imagine adding many more musicians to each orchestra section. More musicians allow for a richer and more nuanced performance from each section, similar to a higher rank in LoRA, introducing more elements to modify the network's weights. This increases the complexity and flexibility but might kill the overall composition.
In this metaphor, "Network Alpha" can be thought of as how intensely each new musician plays their instrument.
In Stable Diffusion, a lower rank would mean more conservative adjustments, keeping the model closer to its original state. Low rank will always be preferred because you want to keep to the original model's adaptability. As a rule of thumb, the more images you have, the lower the rank.
Network Rank: 4-32 (going above 32 is not necessary for most cases)
Network Alpha: 1 or half the rank (16 if your network rank is 32)
Warning: Lower ranks and alpha require a higher learning rate to achieve equivalent detail. Adaptive optimizers should take care of this for you.
[BEGINNER] Using Prodigy optimizer
Prodigy offers an adaptative approach to learning rate. It is a direct upgrade over DAdaptation and is a clear winner over the rest of the optimizers, but at the cost of significant VRAM usage.
Set Learning Rates to 1: This is paramount. Prodigy will dynamically adjust the learning rate during training.
Extra Arguments: In the "Optimizer Extra Arguments" field, input the following settings for a good starting point:
"decouple=True" "weight_decay=0.01" "d_coef=0.8" "use_bias_correction=True" "safeguard_warmup=True" "betas=0.9,0.99"
Understanding Prodigy Parameters:
d_coef (range: 0.1 to 2, recommended: 0.8): this is the only parameter you should change. This parameter affects the rate of learning rate changes. Generally, keep it under 1. For smaller datasets, consider higher values. If your model overfits without learning anything, you should lower it.
weight_decay (recommended: 0.01): This represents a percentage of decay in learning your dataset. This adds a penalty to the loss function to prevent overfitting by encouraging smaller weight magnitudes (promoting model simplicity). During training, this penalty term encourages the model to keep the weights small, effectively reducing their magnitude. By doing this, weight decay helps simplifying the model, making it less likely to fit noise in the training data and more likely to generalize well to new, unseen data. It's a common technique in many optimization algorithms.
Some tutorials might recommend values like 0.1 or even 0.5, but in my experience, this is inefficient. This means losing 10% or 50% of your training every step (which you might realize is unwise). You can go as high as 0.05, but you shouldn't have to change anything. If your model overfits without learning anything, you might try upping it a bit.
safeguard_warmup: Set this to True if you use a warmup greater than 0. False otherwise.
decouple: Keep this set to True. You can read about it in the Prodigy paper (https://github.com/konstmish/prodigy) if you wish to know what it does
betas: Leave this at the default "0.9,0.99" for Stable Diffusion. Understanding betas requires a deeper dive into optimizer mechanics, which I will not do in this guide.
[ADVANCED] Other parameters to tune
Two more parameters are incredibly useful and need tuning. Both prevent overfitting:
Scale weight norms
Recommended is 1. This technique involves keeping an eye on the size of the weights in the network and shrinking them if they get too big. Think of it like maintaining weights in a balanced, manageable range. This is important because how weights compare to each other is more important than how big they are individually. Using this method, you get helpful information like the average size of the weights and how often they're adjusted. This approach is particularly beneficial in avoiding overfitting, preventing the model from capturing noise and other defects from the training data. Monitoring weight scaling can signal overtraining.
Recommended is 0.1-0.5. During training, randomly "drops" a subset of neurons (along with their connections) from the network at each training step. This means that a randomly selected set of activations is set to zero, forcing the network to learn more robust features that are not reliant on any single neuron. By doing this, dropout encourages the network to distribute the learning across all neurons, leading to a more generalized model that performs better on unseen data.
[ADVANCED] Block training
5. Training, Testing, and Troubleshooting
[BEGINNER] How to choose the "best" epoch
Two possible ways:
As you train your LoRA model, it's beneficial to use sampling. Sampling means generating a few images using the state of your LoRA at different epochs/steps. Once you have your samplings, choose the epochs that best represent your concepts. This method relies on your judgment and understanding of the content you wish to generate.
Open Tensorboard (a button is available on bmaltais UI), and closely examine the loss graph across all epochs. Your goal is to identify epochs corresponding to 'local minima' – these are points where the loss temporarily reaches a low before increasing again. The graph below shows a somewhat erratic loss pattern. The epochs marked in red represent the ones I've selected. This approach provides a more data-driven way to pinpoint best epochs.
Either way, you should have a few epochs to test in A1111. For this, generate a grid using all your concepts and the epochs you want to try. Eliminate the earlier epochs that are underfit and the later epochs that are overfit. Then, amongst the remaining ones, choose the one that seems to work the best, and start testing it on a more complex task.
[BEGINNER] Testing and Fixing problems
When training a LoRA model, as outlined in knxo's guide, you have three approaches: fluid, semi-static, and static. Subjectively, semi-static is often the best. This method involves a careful balance between tagging relevant concepts and pruning irrelevant ones. The process is straightforward: keep pruning tags representing "parts" of the concept until the activation tag accurately represents the desired concept, character, or style.
Note: this principle applies equally to SD1.5 and SDXL models.
Let's examine a case with a Jinx LoRA. After preparing and tagging a dataset for Jinx, I conducted an initial LoRA training. However, I encountered a few issues, as illustrated in the following grid:
Arcane Jinx's outfit is not depicted accurately in the rightmost image (which is the image that will be generated by default!): the stomach and shoulders are covered, and the top includes unintended pink colors. To diagnose this, let's analyze the image tags:
You can see that many images have the tag
navel. The frequent tagging of
navel suggests that its absence in a prompt is important and that the LoRA has to cover the navel in the generated image if not tagged. This is due to
navel being a concept that the base model understands, causing the LoRA to emphasize the distinction between tagged and untagged instances. If I want the navel to be shown on the outfit every time the activation tag is used, I probably need to prune
navel from the dataset.
I generated an image with the
navel tag to test this, resulting in a more accurate outfit representation. Similarly, the unexpected jacket-like covering on the arms was traced to the
collarbone tag. Adjusting these tags led to an outfit closely matching my expectations. Here is a before/after for comparison:
Notices how the right image is way better, but still far from perfect. The unusual pink addition to the outfit presents a more complex challenge, as no apparent tags seem to influence this aspect. This requires a deeper understanding, which I will explore in the next [ADVANCED] section.
Your goal will be to iteratively find and fix problems this way. Once you have no more problems, your LoRA will be ready.
What about when pruning tags isn't enough?
Indeed, some problems might arise that tag pruning cannot fix. For example, some concepts might be more challenging to train than others, or your dataset might lack the data to teach the concept. You have a few solutions in this case:
Add specific tags in your dataset aimed at tackling the problem you encounter. This needs to be done exhaustively and flawlessly. Remember whether the tag you add is a modified or a new concept. A new concept requires a lot of images!
Prune images that pollute the training while not providing much. This is harder to catch but can be done by thoroughly testing the LoRA, or training on parts of the dataset. A bad image can have a tremendous impact on the final LoRA.
Add new images for the specific concept you want to train in a new folder. Suppose you are training a character with a particular haircut that SD has difficulty reproducing. In that case, you might benefit from searching for images of characters with this haircut and adding them to the dataset (even if they are different characters!)
Change the repeats for the concept that is lacking. If you do this, don't put too many repeats (five is the absolute maximum), otherwise the LoRA will overfit on these images.
Hopefully, this will help you tackle your own LoRAs' problems.
[ADVANCED] Multi-concepts and balancing datasets
In the previous section, we saw that Arcane Jinx concept had some "pink" artefacts on the outfits. This problem is due to concept bleeding. Let me explain.
This LoRA was composed of 5 different outfits whose tags were:
jinxrnd. Let's look at the tokens of the closest outfits (
You can see that both embeddings share two tokens,
x. These two tokens will be trained alongside the remaining parts
arcane when training. In other words, the "jinx" tag is trained by all outfits, whereas the custom tokens
arcane are trained on a unique outfit. To illustrate this, I generated an image with only the tag "jinx", highlighting the "fusion" that occurs:
Not once were my images tagged
jinx, yet the LoRA correctly generated an image of jinx. This notion is important because it means a tag other than my activation tag was trained! Now why is it only generating the outfit associated with
jinxdef and not the other ones? Simply because, out of the dataset, it made up almost 60% of the whole.
In short, the tag
def concepts. This means that, when training
jinxdef, it will bleed into
jinxarcane. For my use case, this bleeding is fine because parts of jinx (like the blue hair, twin tails, eyes, etc.) are similar. But for other parts, like the outfits, this bleeding is destructive. Let me illustrate with an example:
These are two images generated for the tag
def with their respective attention heatmap (see DAAM section to reproduce this). As you can see from the heatmap,
def both impact the outfits and face. While this is fine for
def to change the face, it is not acceptable for
jinx to change the outfits because this means that other activation tag, such as
jinxarcane, will see their outfits slightly modified by the
jinx part of the tag. In the previous section, where we saw the pink artefacts on the outfit for jinx arcane, this was due to the bleeding of the
jinxdef outfit into the
Note: this also means you can take advantage of this. For example, if you had a "school" uniform, you could use the activation tag "jinxschool" which would make use of the "school" tag to generate the image (essentially modifying the concept of school to fit the new outfit).
For more complex concepts, this "concept bleeding" is essential to understand because they are composed of multiple new simple concepts that have potential bleeding from every other tag. If improperly tagged, a difficult concept might never come to life.
Now, how do we fix this? Three solutions:
use exclusive activation tag: e.g. I could tag jinxdef as
jinxand jinxarcane as
arcane, but this would mean modifying the concept of "arcane" to represent Jinx Arcane. You can also use "rare" tokens to combat this for example
modify the number of repeats to balance the dataset. In this case, I have four times as much jinxdef as jinxarcane. By setting jinxarcane to a repeats of 3, I balance the jinx token between two outfits, and thus get no more problems
use pivotal tuning to create a new embedding
In my case, I want
jinxarcane to benefit from each other. Therefore, I will use the second solution. As for the other tags, I should use the third solution because they are entirely new concepts, but I chose to use the first solution instead.
[ADVANCED] Using DAAM to troubleshoot tags
Note: While writing this article, this method is currently being implemented in kohya-ss, so it should be available soon during training. Alternatively, you can access this feature now by installing the DAAM extension for A1111 from https://github.com/toriato/stable-diffusion-webui-daam.
DAAM is an insightful tool for visualizing how different tags affect your generated images. It provides a heat map highlighting the areas of an image most influenced by specific tags. This visualization can be invaluable for trainers, as it reveals precisely what aspects of an image each tag impacts. Here is an example to help you understand:
This example demonstrates the clarity DAAM brings to understanding tag impact. The heat map uses color coding to indicate tag influence — the redder the area, the stronger the tag’s impact. Regions with a blue tint show minimal or no influence. As a trainer, it will help by showcasing what your tags have learned and enable you to point in the right direction to fix different problems.
Here is an example of such a problem:
In this case, you can observe an unintended effect: the
blue hair tag also alters the eye color, as indicated by the red tones around the eyes on the heat map. Ideally,
blue hair should only affect the hair color.
When fine-tuning your LoRA model, you can use DAAM to analyze the heat maps of your tags (MCs and NCs). Identify any tags that are influencing unrelated concepts. To remedy this, you might need to adjust the tagging of specific concepts that your tag bleed into or modify the weight of specific tags. This targeted approach ensures that each tag accurately and appropriately influences only its intended aspects of the image.
[ESSENTIAL] Anti-AI filters
In the realm of AI-generated content, particularly in Stable Diffusion, an essential aspect to consider is the presence of Anti-AI Filters. They are designed to exploit vulnerabilities in the model to cause it to misinterpret data in ways that are non-obvious to human. The most common are usually self-made gaussian blur, repeating motifs, or adversarial attacks.
The inclusion of images with anti-AI filters in your training dataset is catastrophic for training! It is vital to ensure you don't have any in your dataset. Detecting these filters is near impossible. Instead, here is a script you can use for automatically modifying and cleansing your images of any potential anti-AI filters. To use it, install numpy and opencv-contrib-python (pip install). Usage is: python cleanup_antiai_filters.py --input_file_or_dir="directory/where/you/have/your/images". Be aware that this script makes slight modifications to the images, so use it judiciously.
Note: an alternative approach to ensure a clean dataset is to avoid collecting images after a certain date (before 2022 is mostly fine).
[BEGINNER] Training a slider / LECO
A "slider" is a unique adaptation that allows for the manipulation of specific concepts along a spectrum. Think of it like a control dial that can increase ('positive') or decrease ('negative') the presence of a chosen concept in an image.
Concept of the Slider
Positive Direction: When the slider is moved in the positive direction, it amplifies the chosen concept in the generated images. For instance, if the concept is to control the amount of 'details', the positive side would produce highly detailed images.
Negative Direction: Conversely, moving the slider in the negative direction reduces or 'erases' the concept. Using the same example, the negative side would result in more simplistic images with less detail.
Note: you don’t need a separate dataset to train a slider. It's all about using prompts and the model's existing knowledge.
Training Process for a Slider
Training a slider involves a creative use of Stable Diffusion. It’s about comparing and contrasting how a concept appears in its enhanced state ('positive' side) versus its diminished state ('negative' side). Here’s how it works.
The model starts by generating an image identified as a 'neutral' point — a state that’s neither enhanced nor diminished. Then, using this neutral point, it continues generating two images from prompts that represent both sides of the concept. It assesses the differences between the two new images generated by these two sides. Finally, it updates only the weights related to this concept, allowing you to shift the model’s output towards either the enhanced or diminished state.
Tools for Training a Slider
LECO: Currently, only a few tools like LECO (available on GitHub, https://github.com/p1atdev/LECO) are used for training sliders.
Kohya: Plans are underway to integrate similar functionalities into Kohya's sd-scripts for broader access.
Reference: The concept of erasing or enhancing features in Stable Diffusion is inspired by research and tools like https://github.com/rohitgandikota/erasing/tree/main.
Crafting the Prompts
When training a slider, the choice of prompts is critical. They include:
Target: A simple prompt or tag representing the core concept (e.g., "girl" or "face" if modifying facial features). A common way to think about it is to put the "class" of the concept you are modifying. If you followed the Dreambooth section, you should already know what a class is. If not, a "class" refers to a specific category or type of subject that you are trying to modify.
Neutral: Often the same as the target, serving as the baseline for modification. This is the starting point for conditioning the target.
Negative/Positive: These prompts should vividly represent the concept's diminished and enhanced states. It is generally best to include the target class in this part.
When training a slider, your role will be to find the best prompts that positively or negatively affect only the concepts you are trying to train.
[ADVANCED] The importance of the VAE when training
Remember when I discussed the VAE and its role in compressing image dimensions for efficiency? The truth is this process involves some loss of information when the image is decoded. To better grasp this, here is an illustration:
On the left is the original image (512x512). Once compressed into the latent space by the VAE (to 64x64), this image is decoded back to its original size (512x512) as it would be during a typical image generation with A1111. The difference between original and decoded, as evident, is significant. The lightning was lost, the eyes were barely recognizable and noticeable loss occurred on the mouth.
During training, it's essential to understand that Stable Diffusion will attempt to replicate the right-side image (decoded version), not the original one on the left. If you're encountering issues where minor details in your dataset aren't accurately reproduced (like the eyes), this could be the underlying cause.
To mitigate this, a solution is to train using a specific VAE.
Important: Training with the VAE embedded in the model will suffice for most of your use cases. This knowledge will mostly be useful if you train a checkpoint or a LoRA with a specific style, uncommon details, or images generated with a distinct VAE.
If you wish to experiment on your own, you can use this script to visualize the difference between the original and decoded image: https://gist.github.com/rockerBOO/f1b3c18f4b9fc161310b47d9cb39fcba. Usage is:
accelerate launch debug_vae_from_images.py --pretrained_model_name_or_path="$model_name_or_path" --input_file_or_dir=$input_file_or_dir --output_dir=$output_dir --vae="$vae" --device=$device --batch_size=1
In this guide, we have journeyed through the multifaceted world of Stable Diffusion, exploring a wide array of concepts. The world of Stable Diffusion is rich with possibilities, and with the knowledge and tools provided here, you are well-equipped to embark on your own journey of discovery and creation. Whether you are a beginner just starting out or an advanced user looking to deepen your knowledge, the path to mastering Stable Diffusion is an exciting and rewarding adventure.
How to make a versatile LoRA
How to do block training
An FAQ (e.g. what problem, what solution)
Diagram showing how captions affect training