Preface

This article was written to explain to people outside diffusion what it is and isn't, to hopefully allay some of their fears and to give some factual foundation to any discussion. The only agreement implied or asked for is "please understand what you are talking about" - even if the other person ends up disagreeing or disliking the technology or the behavior around it.

Please keep the intended target audience in mind when reading this; also the original article included a glossary that became the nucleus for my Diffusion Lexicon.

Understanding text to image diffusion

its abilities and limitations; An attempt to dispel the myths, mystique and misunderstanding.

Text to image is an enormously contentious and charged subject that invites a lot of (understandable) gut emotional reactions rather than measured thought right now.

Given how fast/sudden it has arrived in the public arena, and the amount of media attention lavished on it, it was all but inevitable that camps on either side of the argument were going to appear and a dual hype/doom loop was going to be prophesized.

And it is the writer’s belief is that both sides are somewhere between unaware and underinformed; the overriding hyperbole of either side cannot be backed up by (to the writer’s best knowledge at least) existing or expected capabilities.

With of course the caveat that the future is unknowable, and may be surprising.

Basic Principles

At the root of text to image are 3 things:

- the idea (what does the operator want)

- the prompt (how the idea translates into instruction)

- the model (which turns the prompt text into an image)

An artist can directly commit an idea to a medium (paper, pixels, sculpture, etc) and immediately iterate on it to further refine it to their vision.
But the text to image operator has to go through several steps before they can iterate, and at each of these steps, uncertainty and limitations creep in.

Step 1: Going from idea to prompt.

There are two very different “schools” of prompt instruction.

Both are imperfect and limited, with different strengths and weaknesses.

The choice is largely dependent on operator preference, intent, ease of access, and willingness to understand the process.

Option A: Natural Language

These are most publicly visible prompts, and probably do the most to evoke the illusion of “magic incantation that turns into an image” – they typically look something like this:

a woman walks through a sunlit rose garden with fluffy little clouds in the sky

and they certainly look very evocative, especially when the resulting generated image is close enough to the description that the operator can gloss over/accept any inaccuracies, omissions, or substitutions.

Option B: Tagging

This is the method favored by operators who desire more semantic (literal) control; this prompting style tends to be less visible to the wider public:

female, full body, side view, walking, rose garden, warm lighting, sunlight, cloudy sky

which more closely resembles “programming” and even there, the operator must accept that he is communicating concepts rather than delivering literal instructions. The only difference with natural language is that tag-based prompts tend to be easier to “debug” (although not in a literal sense) and steer.
(The writer’s main experience is with tag based prompting)

Regardless of which one gets chosen -- “nat lang” or tags -- , the operator has to compromise and compress the original idea to fit into either descriptive system.

Step 2: Going from prompt to model

Once the prompt is set, it gets processed into “tokens” – basically a ‘road map’ by the model’s language processing layer; there are various different ones with cryptic names like CLIP, Qwen, Auraflow, T5 and probably more.

Under the hood they all perform the same basic function: translating human language into directions for the model.

It is very important to realize that all of these:

- have very defined vocabulary limits; going outside those typically means that concept gets ignored

- are unable to “understand meaning”; instead they match/correlate patterns (token X correlates to training data Y)

- have very weak to no contextual ability; anything but basic relations between terms becomes very unreliable

- suffer from limited “focal attention”; complex descriptions end up getting collapsed/merged

When prompting, the operator has to realize he’s instructing a very semantic, very vocabulary-limited and biased pattern matching system that is:

- probabilistic

- random in at least one and optionally more ways (this will be explained in the next section)

- loaded with very strong priors (dominant training concepts – think of them as ‘strong opinions’) how something should end up looking

What the operator most definitely is NOT doing is explaining himself to an artist.

An actual artist -- even a novice one – has far more capability to “solve” the description of an image than a text to image diffusion model has. The model might produce a more appealing output, but it is not more versatile. It is very heavily limited by its existing training data and incapable of going outside that.

You can explain an artist how to draw a dragon – but if the model has no concept or supporting training data, no amount of describing will work.

Step 3: going from model to output

Once the language processing layer passes the ‘road map’ to the actual model itself, a few things happen:

1. An image “seed” is set. This is either a random or operator-input number that generates a visual noise pattern; think like static, but instead of random it’s patterned based on the seed number.

The best comparison would be like a procedurally generated game (like Minecraft) generates its world based off one number.

2. The sampler takes that noise pattern and the road map it got, and begins ‘following’ the map in steps based on the model data, turning that noise into an image.

Think of the process like a sculptor and a block of (imperfect) stone. The ‘road map’ is the design, the noise pattern is the block of stone (with its own little faults and imperfections) and the model is the overall style the sculptor works in.

Every step is one or more taps with the chisel to ‘reveal’ the end result, and at times, the stone crumbles or might fracture unexpectedly forcing the sculptor to adapt to the emerging end result.

Different samplers have different properties, and the final result can vary depending on which is used.

Some samplers are more stable, some are more iterative (taking into account not just the next step but also the previous one), and some intentionally reintroduce noise for a more probabilistic result.

None of them are ‘better’ or ‘worse’ – it is largely operator intention and preference.

3. Once the sampler finishes at the end of its step count, it passes a mathematical “map” (“latent”) of what it has calculated/determined the image should be (the jargon term is “convergence”) to an output model (“VAE”) that then turns that map into an actual image.

Step 4: Output evaluation

This entire process can take anywhere from a few seconds to several minutes; on a capable home hardware setup this is typically between 10 and 45 seconds per image (depending on hardware and resolution)

Hosted/cloud generators can be both faster and slower; faster due to industrial scale computing power, slower due to popularity and queuing; but they are by no means a requirement for diffusion.

The writer is happily working on a 16GB home GPU and extremely content with it.

Once the image presents itself, the human side of the process takes over and the operator has to answer a very important question: “Is this close enough to what I wanted ?”

The “close enough” is extremely important here; because diffusion is probabilistic, and because it is impossible to describe every detail in the output the operator HAS TO accept loss of control.

- Pose might drift.

- The image may have “collapsed to prior” (where the description fell too close to a model “strong opinion”)

- Details may be inconsistent with the original idea.

- Colors might deviate.

- Background might be a completely different interpretation.

- Lighting can be surprising

- Styling can be unexpected

Most of these can be generally controlled, but none of them can be completely or explicitly controlled.

This is the core difference between diffusion and an artist.

At this point the operator has to choose:

- Keep this (version of this) image

- “reroll” (generate the same prompt on a different seed)

- Change prompt and reroll

- Change prompt and keep seed and hope composition remains stable

And depending on

- prompt complexity

- sampler choice

- model choice

- operator understanding

- operator opinion

going from “I have an idea” to “I have a great image” can take anywhere from 5 to 500 (or more) iterations of the entire process.

It is not even uncommon to never arrive at an image at all, but have to quietly concede that idea is entirely unworkable given the tools the operator has available.

Conclusions

Is diffusion magic ? No, definitely not.

But if you only see highly curated/cherrypicked output you might believe it is, especially if you don’t understand the process and effort needed to arrive at those results.

Can it replace artists ? That’s a harder question.

In all likelihood: no, it cannot, because no matter how sophisticated a model gets, or how good the interpretation between user and model becomes the model is incapable of doing (much) more than it is able to interpret and has data for.

Will it transform the perception of art ? Yes, absolutely.

Just like Photoshop did ~20 years ago (and many of the arguments against diffusion are echoes of that) it will change the landscape – whether it will be for better or worse is up to both users and consumers.

But options like ControlNet (where you can input a sketch/concept) that allow for collaboration between model and artist present interesting opportunities.

Is all diffusion output “slop” ? That depends on your perception.

If you are vehemently against it, then no amount of explanation or understanding of how the result was achieved will change your mind otherwise.

But there are people who work with it thoughtfully, deliberately and who are willing to invest a lot of time, effort and knowledge into what they are doing; equating them to those that just need a quick indulgence or memey image might not be the most fair thing to do.

And finally, the author’s own opinion

Hi there. I wrote this partly out of frustration, partly out of a desire to be understood. There are so many myths and misconceptions about diffusion, and the tribal standpoints have gotten so far apart and extreme .. without anyone really understanding what they’re talking about.

I am not trying to convert anyone to be a proponent, I fully understand the reservation people will have, even after reading this … even I have those. But I want you to understand.

I am uncomfortable with the entire data ownership debate; I want it to be resolved adequately and equitable for all sides involved, but I’m well aware I’m neither wise and/or knowledgeable enough, nor do I have enough overview to be able to present a solution.

In the meantime I refuse to be labeled ‘artist’ because I do not consider myself that. I stand meekly on the shoulders of the people that did contribute to the data I use. At best, I’ll accept the label “diffusion crafter”.

Regardless, I hope you, the reader, has learned something from this article – that’s all I ask.

Understanding diffusion - a primer

Preface

Understanding text to image diffusion

Basic Principles

Conclusions

And finally, the author’s own opinion