A Study of the CFG Parameter
While reading the documentation for an XL model (https://civitai.com/models/118756/duchaiten-sdxl-beta) I happened upon the following by @DucHaiten:
the number of cfg is greatly affected, can replace clip skip. if you want to create surreal images, cfg 3-4 is suitable, cartoon or 3d character images should be clear, cfg should be 5-6, and real people photos can sometimes be up to 7-8 but there are It can be lower, but only if you need high precision and detail to get cfg up to 9-10 or higher, I haven't tried it yet. ... [B]ut I don’t know how, just feel that the prompts are too cumbersome and not as good as before and instead of writing in tag style, You can try to choose a descriptive writing style like Midjourney
Thus, this study. I am exploring how only changing "CFG Scale", with the same seed 326879243
. resolution (taken from the aspect ratio list), and sampling steps 50
, impacts in an XL model.
Defining CFG Steps, or Sampling steps
"Classifier Free Guidance" or CFG seems to be discussed by Ho and Salimans 2022, as a mechanism by which the model itself, without a classifier, can inform its own output.
Classifier guidance instead mixes a diffusion model’s score estimate with the input gradient of the log probability of a classifier. By varying the strength of the classifier gradient, Dhariwal & Nichol can trade off Inception score (Salimans et al., 2016) and FID score (Heusel et al., 2017) (or precision and recall) in a manner similar to varying the truncation parameter of BigGAN. ... To resolve these questions, we present classifier-free guidance, our guidance method which avoids any classifier entirely. Rather than sampling in the direction of the gradient of an image classifier, classifier-free guidance instead mixes the score estimates of a conditional diffusion model and a jointly trained unconditional diffusion model. By sweeping over the mixing weight, we attain a FID/IS tradeoff similar to that attained by classifier guidance. Our classifier-free guidance results demonstrate that pure generative diffusion models are capable of synthesizing extremely high fidelity samples possible with other types of generative models.
Here, "temperature" is a way of describing a chance of predicting something other than the most likely result. Thus, if I'm reading the paper correctly (And looking at the paper's figure 1 correctly), the higher the guidance (classifier or classifier-free) the more likely the model is to hue to its training data for a given classified token. Of note, they say "Interestingly, strongly guided samples such as these display saturated colors." (ibid) and we can observe the same effect in the samples below, especially the one at CFG 30. (Of note, Automatic1111's sliders place an upper bound at 30, but I don't think a higher level experiment is worth the compute.)
Setting aside the extreme saturation, there is a more subtle effect in play. One that is less interesting in a paper asserting "husky" and "cat", and more interesting on a "literate" prompt. And thus, my experiments [1] and observations.
Experiment 1 -- Literate Prompting
For SD 1.5, the image I used to make my profile picture here had the following positive prompt:
An oil painting of a cyborg librarian tending to Borges infinite library (hexagonal rooms, four walls of bookshelves, books packed on bookshelves, marble floors, ornate wood panelling, classy, repeating to infinity), cybernetic librarian (tweed jacket (tweed, academic, shoulder patches, dusty, holes), gears and wires showing in removed patches of skin, crazy moustache and beard, wide eyes, portrait, oil painting, expressionist painter, monet, cybernetic monet, human skin tones, cybernetic skull (balding, skin-plate removed, exposed wires and gears, brass), perspective into library over one shoulder (doorway, marble, tables, endless books), ornate dome overhead (stained glass, rays of light, brass and lead), close up on face and shoulders
Which, as SD 1.5 prompts go, isn't a great prompt.
My prompt for this experiment was:
An impressionist oil painting. Portrait of a cyborg librarian wearing a tweed jacket with elbow pads tending to Borges infinite library with gears and wires showing in removed patches of skin, The librarian has a crazy moustache and beard. The painting is in perspective into the infinite library over one shoulder, clearly showing the ornate neo-victorian dome overhead. The infinite library is a series of infinite hexagonal rooms, with bookshelves on four walls. The cybernetics of the librarian are steampunk, brass, gears, and clockwork.
Which is far more "literate" (no tagged parentheticals). Each sentence contains much the same token-information as the 1.5 prompt, but written for humans rather than computers.
Observations:
Only at CFG 30 was "hexagonal room" picked up
This model doesn't "know" how to deal with tweed and elbow pads.
Taking @Lykon's advice from that first post, I sent the CFG 30 txt2img through an img2img with the following prompt:
Monet's oil painting, with brush strokes and artistic contrast. The painter used large amounts of paint, and the impact of the brush strokes are visible. Impressionist and artistic rendering of a steampunk librarian. The oil paints provide a naturalistic color scheme. Tweeds and rich wooden browns dominate the palette.
This prompt was not successful. While the post-processing at CFG 4 did "soften" some of the absurd colour-burn saturations of the CFG 30, turning up the img2img classifier to 30 did not self-moderate the saturation, and gave the librarian a drunken-cheek look. None of the brush-stroke instructions were followed.
A second attempt at a prompt, HDR photo with a naturalistic filter. Moderate saturation.
Did moderate the hyper-saturation to a degree, but mostly by scrubbing many of the steampunk details.
Conclusions:
@DucHaiten's observation about literate prompting is worth using -- if only for the increased human comprehension while exploring prompt-space.
Experiment 2 -- CFG scale
The primary intention of this study was to explore the impacts of the CFG scale on an SD XL trained model. At 6
and below, the output exhibited low saturation and low sharpness. The change in beard definition and background definition between 3
and 6
is quite notable.
10
produced a solid, apposite, image. Faithful, without the saturation bomb of 15
. The background is adequate, if not accurate and the artistic instructions have some vague attention paid to them.
15
took the saturation impact and embraced it as part of the image -- the library is not faithful to the prompt, but had the jacket not been literal rainbow tweed, I think I would have preferred this rendition the most.
The one I keep coming back to is 30
. It is the most technically accurate in terms of detail work (Sides of room, facial hair, dome) to the prompt, though the least faithful to the requested art style. It has none of the impressionistic haziness that 3
- 10
have. The img2img postprocessing with "treat this like an HDR image" was successful at removing most of the crazy saturation from the generated prompt.
Conclusions:
@DucHaiten's intuitions about the consequences of various prompt levels don't seem to apply in this model. Punching the CFG above 12 or so will result in increased sharpness, hypersaturation, and a much more literal fixation on the trained objects within the prompt. The timbre (to borrow a term from music) of the prompt is less apparent with a high CFG, but the technical accuracy is significantly better.
Looking at some of the other literature on this topic (https://getimg.ai/guides/interactive-guide-to-stable-diffusion-guidance-scale-parameter) there does seem to be an expectation that CFG scales with prompt detail. The operational conclusion that I have here is that CFG level should not be "left" at 7, but instead should be dialed in across the full range before batch-generation commences. I think that 4
, 7
, 15
, and 30
are useful touchstones.
Conclusions
Img2img postprocessing to a more "artistic" style doesn't work with the current model version. It reduced details from the CFG 30
to an "acceptable" HDR photo, but brush-stroke details likely require a LoRA.
CFG has a huge impact on the model's evaluation and literalness, but at the cost of sharpness and saturation. High CFG on "photoreal" outputs does not seem like a high priority.
Low CFG is an interesting liminal space that feels like it explores what some of high-batch-count did for 1.5, and is worth exploring to see if any useful artistic accidents result.
[^1] Here, I use experiment in the "hypothesis-generating" sense and not the "hypothesis-testing" sense. Right now, I'm just feeling things out and writing my observations of the world.