Sign In

Practical Pony/IL prompting; cut the bloat, get better results

1

Practical Pony/IL prompting; cut the bloat, get better results

Preamble/introduction

If you're reading this you probably looked at some of my prompts and thought "why do they look like that, and why are they so different from what I generally see ?"

What a lot of what people 'know' about prompting isn't coming from understanding, but from 'copying what I saw someone else do' - and that's perfectly fine for the most part. That's how I started too and until fairly recently, kept working. You find a prompt on an image you like, you copy it, you change the details of what you want and you send it. Sometimes it works. Usually it doesn't.
Then you go trawl for prompt guides (and there are decent ones out there, been on that path) and you follow those. You might get closer to what you want ... and I am not going to claim this advice will solve your prompt woes, but it might help you.

That preface out of the way, what this article is about:

This advice pertains to CLIP BASED MODELS, which is generally anything in the SDXL Architecture (SDXL, Pony, Illustrious, etc).
That's what I work with, mostly understand and think I can help you with.


What even is CLIP ?

CLIP is an acronym that stands for Contrastive Language Image Pretraining ; it's included in your (SDXL family) txt2img model to translate your text prompt into instructions/directions for the sampler to turn a random noise seed into an image. If you really want to know more about the sampling process I can recommend this article, but for the purposes of this guide, you can treat it as the translation layer between you and the model.


Why does CLIP sometimes ignore parts of my prompt ?

Because CLIP has a very limited 'attention span'. Your prompt is broken up into "chunks" of ~75 tokens (notice, TOKENS -- not tags!) each, that are processed by CLIP.

One "tag" can absolutely be multiple tokens (if you work in Automatic1111, you have a token counter at the top right of your prompt box, pay attention to it!) and once you get into the second chunk (or further) tag priority falls off dramatically.

You should strive to describe the essence of your image in that first CLIP chunk.

If you do not, there's a good probability that CLIP will:

  • ignore it outright

  • randomly pick one or more elements to (not) display.

CLIP also does 'best' when it only has limited "focus" points to resolve; complex expressions stacked with multiple limb actions can easily overrun both chunk and focus.

So how/why does my prompt work then ?



I start every prompt from what I call my "prompt scaffold":

[charactercount, subject],

[camera, pose],

[physical attributes],

[face, hair, expression],

[clothing],

[environment, location],

[lighting],

[rendering, style],

(feel free to copy and adopt!)

(Edited/updated section: Many thanks to John_KSampler and ravemry9 for the callout/sanity check)
Notice I have NO space for 'quality expression' - unless you are working with a really old or unreliable model merge I would urge you to please deprecate them and trust your model, because they are largely folklore holdovers from the SD1.5 era.
That said, if you find/feel your images gain undesirable low(er) quality artifacts/content, try adding one or two in until you see improvement; don't immediately re-add the entire stack at once.

The reason I say this is because you are losing valuable first chunk real estate with them without getting (a lot of) return/gain from them.


Let's go through each block:


[charactercount, subject],

replace that with what you're prompting, (your 'character') -- if I want an 'odd' skin color I prompt it here as well.

[camera, pose],

put your 'viewing angle' and body pose in here. Check for conflicts; the easier you make it for CLIP to resolve, the higher your success chance!
my general order for this is:
overall framing (full body, upper body, portrait etc) - view angle (front/side/rear) - camera tilt (high/low angle or eye level) - general body pose (standing, sitting, etc) - body/head movement (head turn, body twist) - hand gestures/interactions (holding cup of coffee, etc)

[physical attributes],

define what your character look like. Be succinct. Fewer = better.
Think thing like body frame and proportions: athletic, slender, medium breasts -- things like that.

[face, hair, expression],

Describe face/head. My general structure is eye color, hair color, hair cut, face makeup (if any) and 2-3 tags for expression/mouth
You will probably roll over into the second CLIP chunk around here. Don't over-describe, be clear and succinct. One good tag is better than 10 weak/conflicting ones.

[clothing],

What your character is wearing (if anything ;-), yeah I'm guilty too :P )

You are probably in 2nd chunk here, so being concise and consistent matters.

[environment, location],

Describe your scene. Try to capture the essence in one tag, then use 1-2 more for background (and define them as such)
Good: office, window, city background
Bad: accounting office, desk, computer, large pane window, buildings in background, office blocks

[lighting],

You can let CLIP (and your model) 'autosolve' this but if you have a specific lighting need (color, direction, type) specify it here.
Try to stick to no more than two tags.

[rendering, style],

This is optional, if you have a specific styling you want, add it here.

Think of your image and try to complement your image, try to NOT contradict what you've written so far.

Final notes and things to keep in mind:

  • Models have training priors. These reflect dominant concepts in training images and can be extremely hard to work around.

You will recognize them when you prompt for something and the model gives you a specific interpretation over and over.

If you run into one, DO NOT brute-force your way around them by stacking weights and tags, it will not work.
Overriding weak priors is sometimes possible that way, but a strong prior can't be fought that way. Either work with it, or change your scene/wording to avoid it.

  • CLIP is extremely semantic. Often infuriatingly so.

This is where my "Karen" comparison I made comes in. CLIP has extremely limited understanding of context (if any) so it pays to be as precise as you can get within CLIP's vocabulary limit. This probably will take you multiple attempts. Have patience, don't bloat your prompt trying to 'close loopholes' because you may inadvertantly create more or run past CLIP's 'attention span'

  • Accept you will never have full control over the entire image.

Pick your battles. Getting an image seed that exactly converges on your image is nigh impossible. I assume a 75/25 rule: A good prompt will give me 75% of what I have in mind, with the model filling in the other 25% - sometimes with happy little accidents I didn't ask for. (Bob Ross)

You cannot have 100% control. Let it go.

  • Allow your model room to fill in the image.

The more you constrain the model by adding more tags, the harder it will be to converge on an acceptable image.

Think of your prompt as a Venn diagram, and the area overlapping with all circles is your 'image landing zone' -- the more circles you draw, the greater the chance that you reduce that 'landing area', (conflicting/unhelpful tags do this faster) and thus, the harder it will be for you to get the image you want. Then realize that getting an acceptable image (without diffusion anatomy et al) is even harder, so your real 'landing area' i probably only half to a third that big!

That all said, I hope the insights I've learned on my diffusion journey will help you in turn!

1