Prompting with nonsense

I was watching a youtube video on prompting, and this caught my eye:

Source

Segment 3 of this video

It suggests just typing at random on the keyboard to get some effect.

I tried it, this is what I get with useless_anime_merge v12 ( https://civitai.com/models/199126 ) and prompt 'ceruleistic uberkenkouteki maidenfurkerenschaften absconded unbekrafted with an uprifted machinenkranken, wissenmorgen was its nawa, bekrold all preethee the erstwhen candour, background aflakimbo, colours undescallen, best quality, masterpiece'

Theory

What happens if you do that ?

Well, there are many ways to represent words in deep learning. One is one-hot (a vector with 999'999 0.0's and a 1.0 for the word, if the vocabulary is 1'000'000 words long), but 1'000'000 words is too large a search space, that's why we have a tokenizer. By breaking words into its constituent parts, it reduces the vocabulary from 1'000'000 words to 50'000. (more precisely from the CLIP paper, BPE tokenizer with 49'152 tokens).

These 49'152 tokens are then encoded using a lookup table (the initial embeddings in the transformer), before being analysed by the CLIP network. These initial embeddings and the CLIP network have several useful characteristics, one is that they are usually additive (so that priestess = priest + daughter - son, Tokyo = London + Japan - England), the other is that they are usually contextual (so that 'not anti blue' is close to 'blue'), semantically clustering (so that 'red' is close to 'crimson', 'burgundy', 'magenta') and finally because it is CLIP, the network has been optimised to best describe images.

You can use the mini tokenizer in the 'embedding inspector' auto1111 extension to see how a phrase is divided into tokens.

Nonsense

So this should actually work. You'll notice that my nonsense prompt starts with 'ceruleistic', a made up word that sounds like 'cerulean', meaning 'blue like the sky'. And the image has large patches of blue with the sky visible through the window.

If we try to remove this word, or replace it by a simple 'blue', we get:

If 'ceruleistic' is removed, the sky portion is removed, and if 'blue' is also removed, the dress becomes red.

More nonsense

Now if we try the prompt 'X, best quality, masterpiece' where X is one of

nekowiener krankenshaft agassi
inuwiener krankenshaft agassi
wiener dog krankenshaft agassi

Things get interesting. The first prompt generates a girl with showier cloths (don't understand why), sometimes a choker, sometimes cats or a catgirl. The second one confuses the model completely and gives random images (I suppose because neko is more famous in English than inu). The last one always gives a correct wiener dog.

Nonsense style

The 'style' keyword is extremely important in a prompt, because the diffusion model knows that it drives the whole image. Let's try to put made up words just before 'style'.

More images can be generated, in general:

ceruleistic gives windows and sky more importance, sometimes removing the girl in the image
nekomittens is correctly parsed as neko + mittens, always giving a catgirl
anticounterclockwise retrocabulator is difficult, but because of 'anti' and 'counter', the girls are feisty and aggressive, and because of 'retro' they seem to like vintage furniture
farrakat progandicity girls seem to like to work ('pro' is the only thing the model can understand'), they also seem to like shiny skimpy bikinis when they are not working, not sure why ... (if you like this style, see here: https://civitai.com/posts/1571908 )

More images can be generated, in general:

432g 254V 12A implies that the girl is cooking (432g for a recipe) or that she's an underwear model (12A could be a body measurement) or that she's a schoolgirl (doing her physics homework), she'll also offer you food (because she's a cooking schoolgirl, she'll cook for senpai)
32GB NV2 SDHDD is correctly interpreted as a girl with electronics
54hj5+-/43-5432-, those girls like to wear black or darker color outfit, and often have their finger raised to their mouth as if asking a question. Maybe this is mapped to some unknown token.

My theory is that style names have a regular grammar (cyber -> cyberpunk, steam -> steampunk, cottage -> cottagecore, craft -> craftcore), so the tokenization system should somehow be able to make sense of 'bubblegum noire kittencorepunk style'.

More made-up styles: https://civitai.com/posts/1576117

Synonyms

If we replace the whole prompt with a 'girl in bodysuit, X, best quality, masterpiece' and we condition on X, we get:

Notice that ratchet, screw, nuts, bolts are all terms used by mechanics, so now the model is quite sure she works in a garage, without having made the drawing significantly skimpier than the second one.

This is a cherry-picked seed, but you can see that the second example implies an android, a netrunner or the presence of a computer. I'm not exactly sure why 'of average build weight and personality' makes her chubbier and playing with her phone compared to no prompt at all.

Let's give it a try with 'wanton', a word that means 'deliberate and unprovoked (violence)', 'sexually unrestrained' but it's also the name of a Chinese dumpling.

You'll notice the term is quite strong, sexually charged, but also a teapot appeared in the background in case you're hungry. So I guess that's good (or the model also speaks Cantonese).

Putting 'worst quality, best quality' in the prompt

It's easy to make a prompt '1girl, A, B', where A and B are quality (low quality, best quality, etc...). You can look at a generation here: https://civitai.com/posts/1664726 .

This behaves as expected, best quality looks good, worst quality looks pixelated, badly drawn, and the color scheme is discordant and disagreeable.

Now what happens if you put 'worst quality, best quality' in the prompt ? It should get beautifully drawn and drawn not too well, the colors should be beautiful but not matching, it will induce a small feeling of unease in the viewer. Those familiar with music will remember the detuned piano (or devil's piano), usually used to play music in haunted houses in movies.

First prompt with positive quality only '1girl, witch, best quality'

Second prompt with good and bad quality '1girl, witch. worst quality and best quality'

The first witch is cute, the second one gives off a sense of evil and unease. The clashing colors, splashes of green on the skin, ugly curves at the end of the hat.

Another example with 'best quality, worst quality':

Yes, no, yes, no, yes, no, aka Vicky Pollarding your model

Finally, we will look at putting the same word in the positive and negative prompt.

Let's start with putting 'mecha' in the positive and negative:

In this case, the mechas appear in the negative space of the picture.

Let's try again with a wizard and 'spell effect' in the positive and negative.

Because the spell effect can be drawn at any time step, spell effects are created and cancelled several times, resulting in many weaker spell effects drawn.

It seems that putting a word in positive and negative will make several instances appear weakly or as shadows/black, as the model tries to draw and undraw it at every step. The concept will unidentifiably be present in the drawing (as compared to having no mecha at all).

Another example, this unidentifiable cat/not cat:

A last example, this unidentifiable body suit/not body suit, obviously black as discussed previously:

Conclusion

It is completely possible to prompt with nonsense. The usual rules apply, the text is tokenized, the tokens interact, and the model will make the best of it, tone, synonyms, possible interpretations included.