Understanding Prompting and Captioning for LoRAs

LoRAs are great for fine-tuning Stable Diffusion models by adding specialized knowledge without retraining from scratch. However, to truly leverage their potential, it's crucial to understand how and when LoRAs influence image generation. This guide explores key aspects of LoRA behavior, including their interaction with the latent noise space, why they may appear inactive in certain scenarios, and how training with captions affects their responsiveness to prompts.

I started to explore this because I've been training quite a few LoRAs, and I've noticed that while most of them work out, their successes and failures seem a bit unpredictable. I've since tried to delve into the inner workings of LoRAs to get a better idea of how they tick and what they can and can't (or at least tend not to) do. So now its time to share another round of what I've learned by training LoRAs. But first, its good to understand the basics of how they and diffusion models work together.

LoRA Influence in the Denoising Process

At some point in your life, you may have looked up at the sky on a cloudy day and thought to yourself, "Wow, that big cloud over there looks like a giant bunny rabbit," or you saw a cat or dog or Elvis. You get the idea. Then imagine the more you looked at that cloud, the more it started really looking like a giant bunny rabbit. Step by step, the image of the bunny became clearer and clearer until, instead of a cloud, it was a giant bunny in the sky. That's how diffusion models work—or at least a really basic, simplified view.

LoRAs help the base model "see" a bit more of what they were trained on in the latent noise (the clouds in our analogy) than what might generally be visible. When a LoRA is applied to a Stable Diffusion model, it modifies the model's weights through low-rank adaptations. These adaptations fine-tune the model to generate features learned during the LoRA's training. During the denoising process, the adjusted weights influence the generation of the image, guiding it toward elements similar to those in the training data. For example, a LoRA trained on images of cats modifies the model's behavior so that, when relevant prompts are provided, the generated images are more likely to include cat-like features. That "when relevant prompts are provided" part is important because it can help explain why some LoRAs look fantastic on one image and then don't want to work at all on the next one.

This effect is most potent when the prompt aligns closely with what the LoRA was trained on. Botanica (a LoRA for botanical illustrations) works well for plants but starts to fail when asked to make people or cityscapes. See what happens below when an irrelevant prompt is provided.

When familiar elements are present, the LoRA actively steers the image generation, providing a subtle but significant influence. However, this influence is limited in scope and depends heavily on the LoRA's specific training.

When LoRAs Have Reduced Impact

LoRAs are specialized adaptations that enhance image generation within the domain of their training data. If a prompt deviates significantly from what the LoRA was trained on, the adapted weights have less influence, and the base model's broader capabilities take over. This is especially noticeable with prompts containing elements unrelated to the LoRA's training data. In such cases, the LoRA's modifications are less applicable, and the base model dominates the generation process.

This behavior is particularly evident with complex prompts that include unrelated or unique details. In these scenarios, the base model's broader conditioning mechanisms (which help interpret and respond to the prompt) and cross-attention layers dominate. For instance, models like SDXL and Flux employ multiple conditioning layers that excel at interpreting intricate prompts, further diminishing the likelihood of significant LoRA influence when prompts diverge from their learned patterns. The LoRA does not become inactive but has a reduced impact because its adaptations are less relevant to the prompt.

Captioned vs. Captionless Training: Impact on LoRA Responsiveness to Prompts

The impact of a LoRA during image generation also depends on whether it was trained with captions. I've mentioned this some of my previous model-focused articles: although Flux LoRAs train quite well on captionless datasets, there are some obvious differences in the abilities of captioned vs. captionless LoRAs.

LoRA Trained with Captions

Enhanced Prompt Responsiveness: A captioned LoRA learns to associate specific visual features with textual descriptions. This means it can respond actively to prompts that include keywords, phrases, or concepts seen during training. For example, if the training captions included terms like "orange tabby" or "kitten," the LoRA can recognize and emphasize these specific details when they appear in prompts, resulting in more targeted and refined output. An excellent example of this is the "book" based LoRA, Codex Arcanum. I created an unpublished captionless version first, and it was very difficult to consistently create the book-like appearance of the v2.0 published version. All it needs are a few words in the prompt associated with paper or books, and it's good to go to figure out the rest.

Contextual Adaptation: Captions enable the LoRA to recognize and adapt to contextual variations within its subject matter. For instance, a LoRA trained on "cats sleeping on windowsills" will better influence prompts with similar contexts, altering the base model's generation to match those specific settings. This was mentioned in the Urban Decay article, as I trained one version of the LoRA to study the differences between a captioned and captionless LoRA. Below is another example created with the prompt "A man in a ruined garage, dusting off his car." There were no cars or garages in the training images. Notice the captioned versions of the LoRA has no issues adapting, but the captionless LoRA appears to be pretty clueless about the nature of both garages and cars.

Improved Prompt Adherence: Overall, captioned LoRAs offer more nuanced control, making it easier to generate images that align closely with the descriptive elements of a prompt, especially when those elements match the LoRA's training data. This was true in Digital Impressions, as Captionless v1 tended to have visual designs and features closer to its training data than user prompts. Training the same dataset with captions resulted in better prompt adherence.

Look at the examples below, made with the prompt:

"A woman wearing shorts and a t-shirt walking through a city park. She squats down to feed a dog a treat and talk to a man who is walking the dog. She has blond hair and the man is old with gray hair."

The captioned LoRA has better prompt adherence. In the image generated using the captionless LoRA, the woman doesn't squat down to treat the dog, and the model doesn't get "dog" right. These concepts were not in the training data; a man squatting was in the dataset.

LoRA Trained without Captions

Feature-Level Influence Only: A captionless LoRA focuses purely on visual patterns and structures without associating features with specific terms. Consequently, it influences low-level aspects like texture, color, or style, without understanding higher-level concepts from prompts. This isn't necessarily bad, and I've found specific use cases for it, like in the Ephemera Alchemica LoRA. It really is a feature, not a bug.

In the case of Ephemera Alchemica, I wanted the LoRA to replicate the style of authentic vintage labels I found in a few old printing catalogs. I discovered that a LoRA with prompts tended to lose some of the stylistic elements that made the labels have their unique appearance (faded lines, original fonts, illustration styles, and borders). While it was easier to prompt variations outside of the training data, it was more challenging to replicate the exact style that the LoRA was trained on.

Less Precise Prompt Matching: Without exposure to textual descriptions, a captionless LoRA lacks semantic associations, making it more challenging to activate in response to detailed prompts. Instead, it introduces and reacts to general visual features rather than responding to specific directions. This caused some frustration as I tried to figure out why some LoRAs seemed to trigger for visual cues rather than textual ones.

For example, Chaos Weave tends to add fractal patterns on and around circular, curved, or wavy objects. Images with hard, straight lines tend to have lesser or no extra effects when applying the LoRA. In the generated images below, notice that the doorway looks normal, but when "doorway" is replaced by "circular portal" in the prompt, the diffusion-limited aggregation of the fractal patterns starts on the edges of the circular shape and moves outward. Without captions to guide it, the model responds to the similar visual elements in its training dataset (which had no hard, straight borders or edges).

Dependence on Base Model Interpretation: When a captionless LoRA lacks specificity, the base model's conditioning layers take over, interpreting the prompt with minimal influence from the LoRA. This means captionless LoRAs are less effective when prompts require high-level specificity or complex scenes, as they contribute mostly general stylistic elements rather than detailed changes. You see this across a lot of LoRAs—the longer and more complex the prompt, the more difficult it is to get the LoRA to apply properly.

Practical Takeaways for Training and Using LoRAs Effectively

Leverage Captioned LoRAs for Flexible, Targeted Prompts: When generating images based on specific themes or characteristics (such as particular animals, styles, or objects), captioned LoRAs are generally more effective, as they can respond more precisely to relevant keywords and contexts. Captionless LoRAs are better at replicating components from their training dataset and, therefore, have trouble following prompts that deviate.

Use Captionless LoRAs for Broad Stylistic Influence: Captionless LoRAs are better suited for adding general stylistic influences, such as textures or colors, and are ideal when precise, prompt adherence is not required. They also work great to create consistent objects or concepts where variability from the training data isn't helpful or desirable.

Mind Prompt Complexity with LoRAs: The more complex and specific a prompt, the more likely the base model will dominate, especially in stable diffusion and Flux where advanced conditioning mechanisms can interpret intricate prompts comprehensively. For highly detailed scenes, consider using simpler LoRAs or models explicitly trained to handle varied elements within a single prompt.

By understanding these nuances, you'll be better equipped to train and apply LoRAs effectively and get the images you want.