Prompt or LoRA? What should go where?

In this article, I give technical explanations for why I recommend to keep prompts short, to use LoRAs rather than prompts (or embeddings) for quality improvements and style changes, and to use embeddings rather than LoRAs for specifics.

What does Stable Diffusion actually do?

This article is the best explanation for how Stable Diffusion works in detail that I have found.

In short, if you train an AI to turn a picture into complete noise over a series of steps (forward diffusion), you can tell it to "restore the original image" from complete noise over a series of steps (reverse diffusion). Of course, there is no "original image", but the AI doesn't know that. It's a bit like filming a glass shattering and then playing back the video in reverse to form a complete glass out of shards on the floor. Except the shards are randomized.

If you leave the prompt empty and just click "generate", you get an "unguided diffusion"... but not an entirely "unconditioned diffusion" (at least the way I'll be using those terms). That is because the result is determined by the images on which the model was trained and the weights assigned to them. You can "condition" the result further by using LoRAs to add images and/or change weights.

Positive and Negative Prompts

You can also "guide" the diffusion using positive and negative prompts. If you add words to the positive prompt, SD will look at images that it has been trained on tagged with those words and try to steer the result towards those images. If you add words to the negative prompt, SD will look at images it has been trained on tagged with those words and try to steer the result away from those images.

Of course, it's not necessary (or effective) to try to specify anything and everything that you don't want to see in the negative prompt. Instead, negative prompts allow you to move the diffusion away from undesirable interpretations of the positive prompt.

It helps to learn to "speak SD's language". SD has been trained on alt text as well as tags on image hosting sites. Depending on the model, Danbooru tags can work especially well. There is even an extension for Automatic1111 that suggests potentially better alternatives while prompting.

The length of positive and negative prompts is measured in "tokens". This roughly corresponds to the number of words used but not exactly (SD will break up some words into several tokens). Commas and other forms of punctuation are tokens as well.

Tokens form chunks of 75 each. The current number of tokens is displayed in the top right of the prompt box (e.g. "0/75"). If the token count goes beyond a multiple of 75, another chunk is added (e.g. adding two tokens will move you from "74/75" to "76/150").

The "BREAK" command fills up the current chunk with "padding characters". "Face" is 1/75 tokens. "Face BREAK," is 76/150 tokens. What's the point of this? SD processes chunks somewhat separately. If you want to prompt something like "black hair, white shirt" but only get results of black hair and black shirt, you can try using "black hair BREAK white shirt" instead.

The main reason to pay attention to prompt length is that the more tokens in a prompt, the less any individual token influences the result. The reverse is also true: The shorter the prompt, the more adding, removing or changing an individual keyword or embedding can change the result. In other words, the shorter the prompt, the greater the potential control over the result. So, somewhat paradoxically, while starting with a short prompt represents a relatively "unguided diffusion", it allows you to tweak the image more easily than modifying a long wall of text (see my article Short vs Long Prompts for examples.)

The problem with prompting:

"masterpiece, forest, best quality, night, 8K"

is that this is equivalent to prompting:

"masterpiece:1, forest:1, best quality:1, night:1, 8K:1".

Here, "forest" has the same weight as every other element, and the longer the prompt gets, the more SD won't see the forest for the trees. You want to be able to just prompt "forest, night" and get a good result:

The same principle applies to negative prompts: There are diminishing returns and there is a trade-off with targeted negative prompts. That is, if you use a large number of negative embeddings by default, the impact of additional keywords meant to fix one particular image may be reduced. However, in practice, this is rarely an issue for me (and I currently use 200 to 400 tokens worth of negative embeddings). It seems like a good idea to try to give SD as comprehensive as possible an idea of all the things to avoid, and negative embeddings contain much more useful information than any combination of individual negative keywords.

It's possible to increase or decrease the weight of individual keywords relative to the others, using parentheses or numbers, e.g. "(black hair:0.8), (white shirt:1.2)". However, how this affects the result is still relative to prompt length. If you use too high a weight within too short a prompt, you risk "overcooking" the image, with SD giving you an exaggerated interpretation of that one concept. On the other hand, within too long a prompt, even high weights tend to achieve too little.

Note that the "padding characters" added by the BREAK command do change the result but don't crowd out other tokens as much as keywords do. Punctuation characters seem to work similarly.

Embeddings

What about embeddings? Embeddings are just keywords (or more precisely, SD turns all keywords into embeddings). The difference is that they take up a number of tokens set during training (5 being common). How useful that keyword is to your result depends on the associations SD makes, based on the images and their descriptions during training.

A few embeddings are significantly larger (for example Easy Photorealism v1.0 is 42 tokens). In my view, these need to improve the image significantly to offset the presumed reduction of the impact of other keywords.

LoRAs

LoRAs can be added to the positive prompt in Automatic1111. However, this is purely a convenient UI feature: LoRAs have nothing to do with the prompt. Within Automatic1111 adding a LoRA to the positive prompt will simply load the LoRA and then effectively delete the command from the prompt; adding a LoRA to the negative prompt will interpret the LoRA's name as a negative prompt but not load the LoRA.

LoRAs are a way to change how a model behaves. You can think of this as a kind of photographic filter: You are not changing the camera, merely "putting something in front of it".

LoRAs, LyCORIS and Hypernetworks affect the "cross-attention module". That's basically how SD looks at the images it has been trained on when comparing the image it is creating to the prompt it has been given.

There is no real downside to using a large number of LoRAs simultaneously (apart from the fact that their individual effects will likely become more and more obscure to the user as more are added). Instead, they are a way to "create a model on the fly". And that's how I recommend using LoRAs: To create the best-possible "version of the model" that you want to use.

While "LoRA space" is not limited, there are reasons to be careful about what goes into "prompt space" and what into "LoRA space". Some of the most useful LoRAs are very general in their effect (changing style, overall lighting, compositions and amount of detail generated), and trying to combine those with LoRAs that are meant to add something very specific tends to create conflicts on that general level. In my experience, using a LoRA meant to "teach" SD the likeness of a person tends to "teach" it far too much about lighting, composition etc.

So far the results of my experimentation. Let me know if you have experiences or technical insights that add to them or that give a different perspective.