Practical Pony/IL prompting; cut the bloat, get better results

For whom is this article:

If you are beginner in diffusion:

Welcome. You’ve entered a wonderful world full of mythical lore, incredible possibilities … and also, a lot of potential frustration. I am not going to claim I have The Truth™ but there’s a lot of practical insights about working with diffusion here that might end up saving you from a lot of the headaches I have gone through.

If you’re already working with diffusion:

You might know most of the things in this article already, or be completely surprised based on your current methods. Either way, please take your time and give this a try. The results might surprise you.

That preface out of the way, what this article is about:

This advice pertains to CLIP BASED MODELS, which is generally anything in the SDXL Architecture (SDXL, Pony, Illustrious, etc).
That's what I work with, mostly understand and think I can help you with.

What even is CLIP ?

CLIP is an acronym that stands for Contrastive Language Image Pretraining ; it's included in your (SDXL family) txt2img model to translate your text prompt into instructions/directions for the sampler to turn a random noise seed into an image. If you really want to know more about the sampling process I can recommend this article, but for the purposes of this guide, you can treat it as the translation layer between you and the model.

Why does CLIP sometimes ignore parts of my prompt ?

Because CLIP unfortunately has a very limited 'attention span'. Your prompt is broken up into "chunks" of ~75 tokens (notice, TOKENS -- not tags!) each, that are processed by CLIP.

One "tag" can absolutely be multiple tokens (if you work in Automatic1111/Forge, you have a token counter at the top right of your prompt box, pay attention to it!) and once you get into the second chunk (or further) tag priority falls off dramatically.

You should strive to describe the essence of your image in that first CLIP chunk.

If you do not, there's a good probability that CLIP will:

ignore it outright
randomly pick one or more elements to (not) display.

CLIP also does 'best' when it only has limited "focus" points to resolve; complex expressions stacked with multiple limb actions can easily overrun both chunk and focus.

In general, when prompting:

* Don’t use three words where one will do. One good tag is better than multiple weak/conflicting ones.

* Check your vocabulary. You are not explaining to an artist, you are instructing a very literal pattern matching system and things like synonyms, typo’s and unrecognized words will all cause you problems.

* Resist over-describing. CLIP will parse your entire prompt in one pass then pass the tokens to the sampler for convergence (the process of turning a seed into an image) - reiterating your intent only muddies the directions.

* Be sparing in negatives, too. A good checkpoint will lack a lot of the low quality data from older ones, and especially Illustrious merges don’t need pushing hard to good quality.

So how/why does my prompt work then ?

I start every prompt from what I call my "prompt scaffold":

[charactercount, subject],

[camera, pose],

[physical attributes],

[face, hair, expression],

[clothing],

[environment, location],

[lighting],

[rendering, style],

(feel free to copy and adopt!)

(Edited/updated section: Many thanks to John_KSampler and ravemry9 for the callout/sanity check)
Notice I have NO space for 'quality expression' - unless you are working with a really old or unreliable model merge I would urge you to please deprecate them and trust your model, because they are largely folklore holdovers from the SD1.5 era.
That said, if you find/feel your images gain undesirable low(er) quality artifacts/content, try adding one or two in until you see improvement; don't immediately re-add the entire stack at once.

The reason I say this is because you are losing valuable first chunk real estate with them without getting (a lot of) return/gain from them.

Furthermore, especially when you work with Illustrious based checkpoints:

BOOKMARK THE DANBOORU TAG WIKI

It is your reference bible. If you are on PDXL I still highly recommend you do so, because it will help you remove ambiguity faster.

Let's go through each block:

[charactercount, subject],

replace that with what you're prompting, (your 'character') -- if I want an 'odd' skin color I prompt it here as well.

[camera, pose],

put your 'viewing angle' and body pose in here. Sanity check your camera and pose combination; for me, I try to imagine the picture while I’m building prompt. If you have trouble with that, consider picking a game with a character creator (or even one that will let you move/pose a character) and move the camera around until you’re happy, then find the matching camera tags for what you want.
my general order for this is:
overall framing (full body, upper body, portrait etc) - view angle (front/side/rear) - camera tilt (high/low angle or eye level) - general body pose (standing, sitting, etc) - body/head movement (head turn, body twist) - hand gestures/interactions (holding cup of coffee, etc)

[physical attributes],
here you define what you want the image subject to look like.
Think things like body frame and proportions: athletic, slender, medium breasts -- things like that; any ‘odd’ body attachments (wings, tails, horns, etc) also go here.

[face, hair, expression],

Describe face/head. My general structure is eye color, hair color, hair cut, face makeup (if any) and 2-3 tags for expression/mouth
You will probably roll over into the second CLIP chunk around here. Don't over-describe, be clear and succinct. One good tag is better than 10 weak/conflicting ones.

[clothing],

What your character is wearing (if anything ;-), yeah I'm guilty too :P )

You are probably in 2nd chunk here, so being concise and consistent matters.

[environment, location],

Describe your scene. Try to capture the essence in one tag, then use 1-2 more for background (and define them as such)
Good: office, window, city background
Bad: accounting office, desk, computer, large pane window, buildings in background, office blocks

[lighting],

You can let CLIP (and your model) 'autosolve' this but if you have a specific lighting need (color, direction, type) specify it here.
Try to stick to no more than two tags.

[rendering, style],

This is optional, if you have a specific styling you want, add it here.

Think of your image and try to complement your image, try to NOT contradict what you've written so far.

Final notes and things to keep in mind:

All models have training priors. These reflect dominant concepts in training images and can be extremely hard to work around.

You will recognize them when you prompt for something and the model gives you a specific interpretation over and over.

If you run into one, DO NOT brute-force your way around them by stacking weights and tags, it will not work.

And you will likely end up with a worse image at the end of it. You might be able to override a weak prior that way, but a strong blocking prior is nearly impossible to deal with. Save yourself the trouble: either use the prior to your advantage, or change your scene to avoid it.

CLIP is extremely semantic. Often infuriatingly so.

I cannot stress this enough: CLIP does not understand. CLIP semantically interpretes EXACTLY what you prompted. No more. Probably less.

I have compared CLIP to a “maliciously compliant, semantic Karen crossed with a quirky Bob Ross" more than once -- it is a frighteningly accurate mental image.
CLIP has extremely limited understanding of context (if any) so it pays to be as precise/concise as you can get within CLIP's vocabulary limit. This probably will take you multiple attempts. Have patience, and when it doesn’t work your solution should not be to bloat your prompt trying to 'close loopholes' because you may inadvertantly create more or run past CLIP's 'attention span'.
Far better to stop for a moment, read your prompt again, and try to match your tags to what you (don’t) see.

Accept you will never have full control over the entire image.

Pick your battles. Getting an image seed that exactly converges on your image is nigh impossible. I assume a 75/25 rule: A good prompt will give me 75% of what I have in mind, with the model filling in the other 25% - sometimes with happy little accidents I didn't ask for. (think Bob Ross)

You cannot have 100% control. Let it go.

Allow your model room to fill in the image.

The more you constrain the model by adding more tags, the harder it will be to converge on an acceptable image.

Think of your prompt as a Venn diagram, and the area overlapping with all circles is your 'image landing zone' -- the more circles you draw, the greater the chance that you reduce that 'landing area', (conflicting/unhelpful tags do this faster) and thus, the harder it will be for you to get the image you want. Then realize that getting an acceptable image (without diffusion anatomy et al) is even harder, so your real 'landing area' i probably only half to a third that big!

That all said, I hope the insights I've learned on my diffusion journey will help you in turn!

Useful links/further reading:

About samplers (technical): Stable Diffusion Samplers: A Comprehensive Guide - Stable Diffusion Art
A tokenizer webite where you can sanity check yuor (appromixate) token use (handy if you're not in A1111/Forge; take it with a grain of salt though): https://sd-tokenizer.rocker.boo/
The Danbooru Wiki: Tag Groups Wiki | Danbooru