Introduction
While working with prompting (and asking questions to understand more what I was actually doing) I happened to remark (rather sourly) one day that 'CLIP feels like a quirky, stubborn Bob Ross' at one point, and running into semantics over a prompt not long after, I made the 'literal Karen' quip.
A while later, I connected the two together and the cursed image of "Bob Ross with a Karen haircut" floated through my head together with the realization that's really how CLIP feels (even after you clean up your prompts!) -- exactingly semantic and literal, but still capable of having "happy accidents" that can (pleasantly) surprise you.
And in a fit of inspiration, half as a joke: I wrote a prompt for an image.
Why "CLIP-chan" though ?
Because for me it reinforces the mental model of what CLIP is and does:
A very literal interpreter of your prompt that you need to very precisely tell what you want (semantic/literal Karen) - but still have to leave enough creative freedom to fill in everything outside that (quirky Bob Ross).
Hence her name: Karen Ross.
This goes back to what I said in my previous article: you cannot have full control over diffusion, and you will fight with (and lose to) CLIP if you try anyway.
And by giving CLIP-chan "moods" like: (links go to images)
angry - you gave a confusing/bad prompt that either contradicted itself somewhere (like trying to specify two mutually exclusive things, like conflicting camera angles)
Your prompt wasn't clear enough to interpret -- and this especially important when using Illustrious -- check Danbooru for the proper tag; (like using 'front view' when Danbooru has 'straight-on' as main tag); this applies to PDXL as well, just not as strongly;
You tried to overconstrain your prompt (adding way more descriptive tags than is necessary; like "road, highway, road signs, lanes" -- just "highway" would have sufficed)happy - your prompt landed perfectly, CLIP had no problem reading it and giving you what you were looking for (well, mostly ....)
mocking - when you make one of those little user mistakes that make you go "why did I get that ... oh ... OH ...I screwed up!"; usually caused by not properly cleaning up/resetting your prompt of confusing/old/misaligned tags when changing scene or view. (Yeah I've been there)
defeated - when nothing seems to be working; you keep running into blocking priors, model training gaps, can't find right LoRA, all wording you tried failed, no matter how many seeds you go through.
This does happen. It sucks. Either find a prior you can work with or be prepared to change your idea.creative and busy painting - busy generating your next image and creatively filling in all the areas you didn't specify (and some of the ones you did)
playful "oooh, I'm going to paint a happy little tree right there where you didn't expect one!" - the model filling in the image creatively (and it's so important you don't overconstrain your prompt and give it that space)
exhausted - how CLIP would feel after you've gone through many, MANY ideas, prompt changes, seeds, evaluations, picks and discards; although this is not really technical and more a 'character trait', just like
sleeping/dreaming - after you're done creating, waiting for the next session ;)
this all helped to solidify an identity beyond the systematic understanding of what CLIP is and what it does for you; same goes for the outfit I gave her: rainbow colored shirt, practical cargo pants and (very stereotypical artist-y) sandals.
The takeaway
Giving CLIP that 'personality' helped me critically read my prompts and think "how would/could CLIP get this wrong" before I run it; and if I didn't get what I expected, rather than blame the model or try to "fix the loophole" by contorting the prompt (I admit it - I've been there, and you probably too) with either more verbose descriptions or by adding a bunch of tags, I try to critically compare prompt to what I got and try to figure out why it happened (and if you can point to your prompt and say "ah of course, that's why that happened" so much better!!)
The model gives you its interpretation of what you told it to do (Karen) and will give you a finih/flourish in the areas you left open for it (Bob Ross), it's up to you to "explain it better" - and "better" generally means "different", and not "in more words" what the elementss are you do want.
Anyway, I hope my mental model makes sense to you all; I'll probably keep doing images of her based on my experiences in working with diffusion.
Happy & safe diffusioning, stay on her "happy accident" side, avoid the "I demand to speak to the prompt writer!" half!!
