In summary: CFG matters, increasing CFG scale is a reasonable expectation to increase prompt-fidelity (to some degree) at the expense of absurdly high-saturation, almost to the level of a "colour burn" filter. The full generated series is available in an image post.
Changelog:
I drafted this article some time ago in the descriptive text of one of my posts. It turns out that no one reads the text of posts, and thus -- I will be updating the article here instead.
The impact of CFG within an SD XL model
While reading the documentation for an XL model (https://civitai.com/models/118756/duchaiten-sdxl-beta) I happened upon the following by @DucHaiten:
the number of cfg is greatly affected, can replace clip skip. if you want to create surreal images, cfg 3-4 is suitable, cartoon or 3d character images should be clear, cfg should be 5-6, and real people photos can sometimes be up to 7-8 but there are It can be lower, but only if you need high precision and detail to get cfg up to 9-10 or higher, I haven't tried it yet. ... [B]ut I don’t know how, just feel that the prompts are too cumbersome and not as good as before and instead of writing in tag style, You can try to choose a descriptive writing style like Midjourney
Thus, this study. I am exploring how only changing "CFG Scale", with the same seed 709479888
. resolution (taken from the aspect ratio list), and sampling steps 60
, impacts in an XL model.
Defining CFG Steps, or Sampling steps
"Classifier Free Guidance" or CFG seems to be discussed by Ho and Salimans 2022, as a mechanism by which the model itself, without a classifier, can inform its own output.
Classifier guidance instead mixes a diffusion model’s score estimate with the input gradient of the log probability of a classifier. By varying the strength of the classifier gradient, Dhariwal & Nichol can trade off Inception score (Salimans et al., 2016) and FID score (Heusel et al., 2017) (or precision and recall) in a manner similar to varying the truncation parameter of BigGAN. ... To resolve these questions, we present classifier-free guidance, our guidance method which avoids any classifier entirely. Rather than sampling in the direction of the gradient of an image classifier, classifier-free guidance instead mixes the score estimates of a conditional diffusion model and a jointly trained unconditional diffusion model. By sweeping over the mixing weight, we attain a FID/IS tradeoff similar to that attained by classifier guidance. Our classifier-free guidance results demonstrate that pure generative diffusion models are capable of synthesizing extremely high fidelity samples possible with other types of generative models.
Here, "temperature" is a way of describing a chance of predicting something other than the most likely result. Thus, if I'm reading the paper correctly (And looking at the paper's figure 1 correctly), the higher the guidance (classifier or classifier-free) the more likely the model is to hue to its training data for a given classified token. Of note, they say "Interestingly, strongly guided samples such as these display saturated colors." (ibid) and we can observe the same effect in the samples below, especially the one at CFG 30. (Of note, Automatic1111's sliders place an upper bound at 30, but I don't think a higher level experiment is worth the compute.)
Setting aside the extreme saturation, there is a more subtle effect in play. One that is less interesting in a paper asserting "husky" and "cat", and more interesting on a "literate" prompt. And thus, my experiments [1] and observations.
Experiment 1 -- Literate Prompting
This discussion by Chris McCormick, summarises CFG as:
And the CFG “Scale” refers to the ability to increase or decrease the amount of influence the text description has on the image generation.
There is also some "common knowledge" (i.e. I can't figure out where I heard this can searching turns up nothing, so that and $5 gets you a flat white) that suggests that there should be a higher CFG for a longer prompt. Thus, inspired by the literate prompts from @AiGeisha, and a request for fantasy from @Afroman4peace, here is a fantasy coffee on a foggy morning.
I chose the forest and "foggy morning" quite intentionally, to default the image to desaturated and muted tones.
A young woman in an very dark red felted wool cloak boils coffee in a small copper pot in her campsite in the deep eastern european woods on a very foggy spring morning, She is a no-nonsense adventurer in the 17th century travelling up from Istanbul, Her large backpack is resting against a tree, The bark on the trees and the loam of the forest floor is highly detailed, the copper pot rests on a cast iron trivet above a bed of warm charcoal which is all that remains from the evening's fire, the cute girl's axe is sheathed and belted into her backpack resting against a tree, her campfire is ringed with flat stones found from the surrounding woods, warming her hands on an enamelled mug, hyperdetailed painting in watercolor graphite and ink, low + realistic fantasy, inspired by the deed of paksenarrion
The "dark red", "copper", and "warm charcoal" were chosen to allow for the predicted hyper-saturation of high-CFG to really pop. The Deed of Paks was lurking in my mind as a mood inspiration -- I strongly doubt that it has any actual impact on the image.
CFG 1 is a strange place
Before looking at this CFG scale in particular, it is worth raising the fascinating interpretation of CFG 1
:
https://civitai.com/images/2116047?postId=518358
All of these images use the exact same seed. But only at 1 is the fundamental composition/layout of the image different. My expectation is that this image is very close to copyright infringement of some famous painting. It is not faithful to the prompt in any details -- but it does achieve the broad strokes.
Observations
Interestingly, coffee
only showed up at CFG 30
:
https://civitai.com/images/2116056?postId=518358
Though even at 30
, many of the salient details of the prompt were omitted: trivet, backpack, axe. It also didn't improve the image composition in terms of that odd floating kettle. The mound of coffee beans and the stabby forest rat at CFG 15
is just surreal.
CFG 4
seems to be the most useful of the generations -- hitting the desired saturation, copper kettle, and "warming of hands." There is no evidence, however, that the "faithfulness" of the prompt increases at higher CFG.
Literate Prompting Conclusions
Literate prompting, on its own, is not sufficient for the generation of good images. Increasing the CFG on a long prompt causes an increased "fixation" rather than "faithfulness" -- the dominance of coffee beans, the lack of a well-defined campfire, and the surreal hyperreal colours of high CFG point to that mode being an intentional stylistic choice. Even at CFG 13, the image feels less authentic to the prompt than at 4.
I think the correct impression here is that my wordy prompt caused too many opportunities for model fixation. If using literate prompting, the overly-detailed prompt as above is a detriment.
Experiment 2 -- Avoiding SD 1.5 negative prompts
In their guide, sevenof9247 suggests:
please leave neg-prompt blank until you see something you don't want in the image.
I used: visible flames, flames inside a pot, fire, tent, snow
as my negative prompt. These negative terms were chosen by me from prior from test-runs of the image. While there was some aberrant legginess at CFG 13
and 15
, my expectations of 1.5 suggest that far more significant negative prompting would be necessary in that earlier foundation model.
Observations
On the other hand, "fire" is obviously "Just like, your opinion, man" in the case of this checkpoint, causing CFG 7
through 15
to have an odd lantern instead of a firepit. No coals were ever observed. "Snow" was correctly removed, as was "tent."
Conclusions
I would say that my experiment supports this novel style of negative prompting at all CFG levels, though it doesn't look like CFG impacts the "fixation" on the negative prompt to a useful degree. Further experimentation is warranted.
Experiment 3 -- CFG scale
The primary intention of this study was to explore the impacts of the CFG scale on an SD XL trained model.
Observations
At 4
and below, the output exhibited low saturation and low sharpness.
7
is, correctly, the default. Though it's worth trying 4,7,10 to see if the fixation or lack-thereof is important to the specific prompt.
At 13
, the model started developing a characteristic hyper-saturation, and at 15
it almost became cartoon-like.
Conclusions
@DucHaiten's intuitions about the consequences of various prompt levels don't seem to apply in this model. Punching the CFG above 12 or so will result in increased sharpness, hypersaturation, and a much more literal fixation on the trained objects within the prompt. The timbre (to borrow a term from music) of the prompt is less apparent with a high CFG, but the "technical" accuracy is significantly better. (See "coffee" above)
Looking at some of the other literature on this topic (https://getimg.ai/guides/interactive-guide-to-stable-diffusion-guidance-scale-parameter) there does seem to be an expectation that CFG scales with prompt detail. The operational conclusion that I have here is that CFG level should not be "left" at 7, but instead should be dialed in across the full range before batch-generation commences. I think that 4
, 7
, 10
, and 15
are useful touchstones.
1
is worth the occasional experiment to see what the model will do if left to its own devices. At 1
, the prompt is a mere "vibe" or suggestion, and sometimes that sort of free-wheeling pesudo-creativity is desirable.
Conclusions
CFG has a huge impact on the model's evaluation and literalness, but at the cost of sharpness and saturation. High CFG on "photoreal" outputs does not seem like a high priority.
Low CFG is an interesting liminal space that feels like it explores what some of high-batch-count did for 1.5, and is worth exploring to see if any useful artistic accidents result.
Second experiment: The impact of CFG on "action" prompts across multiple models
X/Y/Z comparison across models https://civitai.com/images/2122352, 40 megs
Observation
Very few CFG 3
outputs were usable, though 3
, 7
, and 11
all suffered from problematic arms and legs to a degree unusual in SD XL images. Increasing CFG did not reliably increase "quality" though the models did fixate on "very happy" to a greater degree at 11
.
Conclusions
When going for a tricky detailed prompt, it's not worth generating at low CFG. Usually, the specificity of prompt suggests a high desire for specific details. It is, however, worth trying to figure out what a given checkpoint model will fixate upon and iterate. Sometimes the fixation matches to the desired emphasis in the image.