Miscellaneous musing - CFG/Accelerator LoRA/Realistic models

I wanted to write and article about a few subjects, but none of them deserved a full article so... here is a big bag of randomness 😅

CFG - Classifier Free Guidance

In the direct line of my article about how SDXL works, i wanted to explain a bit a few things about CFG, which is NOT an acronym for "Configuration".

The first generative models for pictures (let's not go back all the way to GAN but the first diffusion models) where using a Aesthetic Classifier model to guide the generation of better pictures. Those models are classic "scoring" models that helped drive the result but this was making training models harder: they needed to train the Classifier Guidance model on top of the UNET.

Classifier Free Guidance was an interesting idea: what if we could train the UNET to serve as this classifier model? Introducing "conditional" and "unconditional" generation.

First, let's clarify what the UNET does: it does NOT generate the picture associated to the conditions (the prompt, resolutions, etc...) but PREDICTs what noise should be removed to go toward the picture.

There is a reason mathematically that this is more efficient that trying to make the image directly (noise is noise, it does need to have a meaning and can be manipulated with addition/substraction, you don't subtract an image from an image).

So, the UNET is in fact run twice:

once in the conditional way (with the prompt) => this generate some noise (N1)
once in the unconditional way (without the prompt) => this generate some noise (N2)

To run a step of denoising, the model will use the Latent picture (initially just noise in txt2img, an image with added noise in img2img), N1 is substracted from the Latent. Rinse and repeat for the number of steps to get the final picture.

And then, depending on the value CFG, N2 is also deducted at each step. The higher the CFG, the higher the influence of N2.

And then, what was later added was the notion of "Negative Prompt". The unconditional way initially was "no prompt", but the idea of "why not prompt for the opposite of was is wanted" came and the negative prompt is now in the unconditional run.

[TODO: insert here a picture explaining this]

Some older generative softwares are running conditional and unconditional one after the other (so, two pass of the UNET) but most softwares are able to run both at the same time for faster generation.

Now, why talking about this? The impact of CFG on image generation. It is known that lower CFG help having more creative output => that's why. The negative prompt aka the "integrated classifier" of the model is less involved in the final picture. And this is also why if CFG = 1, the negative prompt is disabled => the unconditional run is then disabled => faster gen, more funky results... And often, at lower CFG, more "blurry" pictures. But then, let's introduce the accelerator LoRA.

DPO/SPO/Lighting/DMD2... Accelerator LoRA

Several studies were done to try and optimize achieving better images faster (so, less steps). Several of those LoRA are readily available on Github associated to their studies and have been "published" on Civitai. I tried a number of those but never was really satisfied until i tried recently the DMD2 LoRA. People using SDXL realistic models know very well about this one since several models offer an pre-merged version of the model including the LoRA.

But i never tried it with an Illustrious XL model, not until someone else (Thanks to clueless_engineer !) used DMD2 with one of my model. And indeed, with only 8 steps versus 30-40 steps and a low CFG (even 1!) my models delivered great pictures. Not perfect but good enough for most use case.

It may not be very important for people with huge GPU, but on my poor legendary laptop, this was a revolution in order to find good seed and result. less that 3 minutes for a result instead of 10-15.

Here is an example:

1girl, summer dress, straw hat, from side, looking at viewer, beach, blue sky, seagulls
CFG 1
Euler A + Auto
8 steps
Adetailer (8 steps too)

Without the LoRA:

With the LoRA (at 0.7):

Time taken to generate the picture: 55s!

If i remove the LoRA and do more steps (30), at CFG 4 (with no negative prompt):

What if i kept the LoRA?

In both case, I got more details, more contrast, some more shading... but it took me 3 minutes and half for the picture without the LoRA and 4 minutes with the LoRA to get the picture!

If i am looking for a good seed or exploring prompts, this is tedious.

I feel like this could improve my local generation => make DMD2 versions of my usual models (AnBan/HoJ/UnNamedIXL with the LoRA merged to avoid the overhead of the LoRA on the inference), search for good composition/prompt, switch to the normal model and more steps when doing the final result... this could also help me when using multiple models in search for mixed "media" result.... such as with realistic models (as explained by TurinBjorn)

Realistic models

Most realistic or semi-realistic models are more rigid than the pure anime model based on Illustrious. They don't like dynamic pose, some specific angle for the shot, too fantasy or futuristic elements and often don't allow the model not looking toward the "camera". Here is an example:

Even some LoRA don't react well with Realistic models... so, how to fix this? I know of three main methods (one of those i presented in a previous article) for using an image from a "anime" model to drive the generation of a realistic model:

ControlNet
Refiner/Hires model switch
Img2Img

Here is the result when using the previous image (using AnBan Shin V1) with Unnamed IXL V3 as target:

With ControlNet (30 steps, CN-anytest_v4-marged at 0.5 strength => 5 minutes):

The result with refiner switch (at 0.5, 30 steps => 4 minutes 30):

And when doing Img2Img (denoising at 0.5 => 2 minutes 30):

Extra mile with Img2Img with ControlNet (conf from both previous examples => 4 minutes 20):

This increase the difficulty to generate lots of pictures => more steps, making sure to stick to initial prompt, configuration mistake can happens, activating adetailer is often a real need. And for a proper fine-tuning, at least 200 good quality pictures is needed (so, probably 500 generations).

But with the DMD2 LoRA, this could help me generate faster MORE pictures. With a lot of pictures, i could either make a LoRA to fix my model or directly train it.

Here is an example with both DMD2 merged versions of AnBan ShinV1 and UnNamed IXL V3 (8 steps, CFG 1, refiner switch at 0.5, NO adetailer):

This was generated in 2 minutes!! And feels more realistic than the previous one while keeping the composition. That's about 2-3 minutes gained per pictures, 25h saved if i go for 500 of those!

For reference, here is what a pure UnNamedIXL V3 run would have made (30 step, CFG 4, no adetailer => 2 minutes 40):

And with "dynamic pose, dutch angle"?

With the refiner switch at 0.5, 8 steps, CFG 1 => 2 minutes sharp:

All in all, this feels like a future V4 of UnNamed IXL in the making 😉

Thanks for reading! 💕

Bonus trick

Refiner is somewhat instable with Forge and make it impossible to use Hires.Fix at the same time. But switching model during Hires.Fix is also an option to generate specific pictures (instead of mass generation). It is particularly useful since Hires Steps are working on larger a picture and are slower (i often go from 8s/it to 20s/it).

Here is the same picture as above with 8 steps with AnBan Shin V1 DMD2 and Hires.Fix (x1.5, 8 steps, denoise at 0.3 and switch to UnNamedIXL V3 DMD2):

It tooks 4 minutes to generate instead of my usual 15 minutes for a single model with my classic CFG 5, 40+20 steps 🥰

Update: Follow a comment, let's see if my method can handle shiny, thight skin clothes:

1girl, (micro bikini, black latex bikini, shiny skin tight bikini), (facing viewer, looking away), beach, blue sky, dynamic pose, dutch angle