Fine Tuning A Generated Image Using FreeU

Even though FreeU has been around almost a year now, and even found its way implemented in the Forge WebUI, I found hardly any explanation how to actually use it. This video tutorial below explains and SHOWS how various factors influence the image, and I really appreciated how it helped me to understand the basics of baseline/backbone and skip.

YET, it still does not show how to actually use FreeU for finetuning an image. But before I show you how, let me get the basics together.

Hamfisted Basics

FreeU is a bit like an equalizer (if you are into audio). It takes parts of the Unet that is part of the models/checkpoints we use, and reduces or increases the volume of those 'frequencies' during the image generation. But instead of audio waves, it has an effect on the Unet and how it interacts with the concepts and their representation in guidance and how they are weighed through the actual U-Net of the model we use. It causes some of them to have more influence and some less. But in comparison to (weighing prompts:1.2) it affects them based on their position and action in the Unet instead of the classifiers.

B - Baseline or Backbone are interactions in the Unet that more or less run the whole way. Being fully decoded and encoded back, they influence A LOT in the image generation. Imagine a main character in a story for example. Usually the whole story is about the MC interacting or being somehow involved in everything, and thus influencing everything. Depending on your model (or type of story genre) and the integrated concept weights, B usually influences characters and styles. In a movie we would say it affects how much screentime or influence the MC gets. But if you use a model trained for product placement, the changes will not be about a person, but about shoes and perfume bottles, or about rivers and mountains in a landscape model.

S - Skip is about the prompts that do not necessarily have influence all the way through generation. Their action skips from one side of the U (in Unet) to the other, without necessarily influencing a lot of the generation. Sometimes a skip is about the (prompted) color of a single spot. Prompts that do have only weak influences on their surroundings and likely no impact on the whole image. Therefor it skips after it did its job and finds no further interactions. Sometimes it is also acting as a partial influence of a more complex prompt like (goth girl). Those Skips might influence tattoos, chokers and other contextual items, which do have a more pronounced impact on the image. It all depends on how the model you use has been trained.

The numbers of the Bs and Ss are refering to their position in the process. Meaning B1 is in the front, where the AI brain cells have a lot of noise mush to create things from, but not that many existing classification based on the prompt. B2 on the other hand has less noise-to-shapes-values it acts on, but more context, coherence, consistency in its statistical values. One could say B1 is the creative or "mad" part of the brain, while B2 is the logical and "sensible" part of it. B1 says "oh lookie a human shape in the fog, like the prompt said" while B2 acts based on the propabilities it learned and says "okay so if this is its head, there must be the neck, the shoulders, arms, hands, fingers". More about that later.

The influence of S1 and S2 are somewhat similar, but affect different concepts and how their details are created based on the already processed things in the (latent) image. S1 is about the things that are actually 'obvious'. Like if B1 sees sky, S1 makes the obvious statement about a successive detail like "the sky is likely blue, that is based on the ... bla bla bla skip" ending its influence till the end of image generation. Analogue to B2, S2 has a more complex logic about details (aka things that don't need to be processed over the whole process, but can be skipped to the end). This means S2 will be using B2s observation that a person is in a shower, and this their skin must be shown wet. S2 will notice that beyond being wet, "the skin is a complex construct with a high texture, which will will make it likely on the other hand it will likely make small droplets instead of large water sheats...*bla bla bla bla skip*". But instead of acting on concepts or classifications in the early "mushy" part of the generation, S2 focuses on the later part with increased complexities.

Using it for Finetuning

Okay, as an example, I have a batch of images with various settings that all use the same seed and prompt based on TrifectaXL as a model. I choose this, as I noticed that Trifecta works better when tuned with FreeU. The prompt is a wordsalad:

23yo woman in a linen dress working in the garden, (candid photography style:1.3), 8k uhd, dslr, (natural lighting, day light, illuminated:1.3), high quality, (film grain:1.3), Fujifilm XT3, (realistic skin, detailed skin, rich skin texture, raw, 8k, (skin pores:1.3), intricate details), Photorealistic, Realistic, High-definition, hires, ultra-high resolution, 8K, high quality, Ultra-detailed

29 steps, cfg of 4.5 at 1024x1024 and an SDXL VAE. The first images run with an SAG of 0.5 to a sigma of 5.

This is the baseline with all B1B2S1S2 at 1,1,1,1. Honestly, it's meh in about everything.

Now we will have a lower B1 of 0.7

As you can see, this influenced the core concepts of the image. Mostly the character and the 'quality' of the picture are dampened in their conceptual impact as the B1 value has been decreased. Not much "brain power" went into the cut of the dress, and as Trifecta is an NSFW model, we have a tendency towards a breast slipping out. Yet, all the logical structures of the garden and the more detailed shapes and poses that are likely covered by B2 are very similar. As well as the finer details covered by the Skips.

B1 for this image is 1.18. As we can see , the expression and adherence to the core concept (based on the model) is more pronounced. Given a higher weight, the result tended towards the more "beautiful and refined" woman working in the garden compared to the "casual and frumpy" direction it took with a dampened B1. This even extends towards the cushioned seat and table entering the image now. Plus, because NSFW model, we also have the areolas visible beneath the dress.

This image has a B1 of 1.18, B2 of 1, S1 of 0.7 and S2 of 1.2. Here we have an interesting result, as the raised hand suddenly fondles her hair. The changes to the skips also reduced the number of plants in the flowerpot on the left and gave the fence in the back a more rugged look. It also removed the flowers on the bush in front of it and pronounced the shape of the nipples beneath the dress more. A very nice effect of reducing S1 is the change from the more complex threadstones in the grass to the less complex extension of the scattered stones and the simplification of the table on the right. It makes it very easy to see how S1 deals with obvious details, while S2 reacts to the added weight by increasing the visibility of the logical visible nipples if there are areolas already. Which one of the weights created the hair fondling is not obvious. It can likely be a result of both, even though I would attribute it to S2 more.

In this image I tried to remove the nipples by increasing B2 to 1.1. As it is obvious, this did not work, but it shows, how it worked on everything that has been put there by the other parts. The hair has been defined in a single strand (increased focus on further developing an existing element). The scree patch in the background reverted to a more detailed patch of threadstones in the grass, and the wooden fence even got some intricate details on top. Much like the potted plant becoming more complex by the increase of the B2 value. Even the facial expression of the model changed. B2 is putting a higher focus and therefor creates complexity based on what B1 created as less complex elements earlier in the generation.

In this image, the B2 is at 0.85, and suddenly the prominence of the logical resulting visible areolas is decreased a lot as well. As S2 is still at 1.2 though, the smaller (less dictated by the character prompt) details are still there. The treadstone path is gone again. Lowering B2 removed the complexity in the latter parts of generation, while the still increased S2 kept some of the complexity in the details like the potted plant or the wooden fence. Or her facial expression.

After some more fiddling with the values, I came to this image, which also includes a slight change in the SAG to increase the processing time for the hands (leading to a better result). These are the values I used:

freeu_b1: 1.21, freeu_b2: 0.81, freeu_s1: 0.95, freeu_s2: 1.2, sag_enabled: True, sag_scale: 0.9, sag_blur_sigma: 2

Summary

So, what is the point of it? Should you simply copy the values? I'd say: NO!

My intention was to show you how fixing a seed and then slowly adjusting the values is the key to using FreeU as a finetuning tools for an image. Not just a finetuning of things done via image processing, but a finetuning of concepts, complexity and the intricacies of the used model. I hope you saw a way to use FreeU as a tool to open the hood of your model and boost where it is weak and dampen it where it is strong or outright overtrained when it comes to certain concepts. Concepts that cause similar faces, bad image quality or just a wild bunch of additional arms or legs.

Tuning with FreeU

Fine Tuning A Generated Image Using FreeU

Hamfisted Basics

Using it for Finetuning

Summary

Comments