santa hat
deerdeer nosedeer glow
Sign In

Flux Basic Course 1: Using CLIP_L and T5 subject fixation

Flux Basic Course 1: Using CLIP_L and T5 subject fixation

Synopsis:

  • This course assumes you already understand the basics and have the system set up, but if you don't there are available links below.

  • T5 is a powerful utility made to be formatted and repurposed by design. It's an LLM that can do any number of trained tasks and was fed with a large array of generic information. The more it's specialized in a task, the more it'll respond. Simple enough.

  • This isn't meant to be a technical article, so lets just jump right into the hands on stuff.

  • Every image here is generated using BASE Flux1D fp8 with the t5xxl_fp8_e4. We're using the compact version for convenience. We aren't using fp16 because it takes too long to generate even on a 4090, and I'm not using quantified versions in this article for T5 generation purposes intended by the developers.

    • 768x768, 1024x768, 768x1024, 1024x1024, 1216x832, 832x1216

    • steps 12 to 50

    • euler > simple

    • distilled cfg between 1 and 10, 2.5 to 4.5 (3.5) ideal values for base models.

    • normal cfg always 1

    • seed 420

      • Everything is generated at seed 420 unless otherwise specified.

  • For convenience

  • If you require assistance setting this up, there are plenty of other resources available from a cursory search.

Concept 1: Subject Fixation

  • Every image has a single or combination of subject fixations, whether defined or not. We can focus on a taco, or a chicken sandwich, or a more complex subject like sliced apples in a taco on a plate on a table in a starbucks in the middle of a bustling city street populated by chickens driving mopeds with bacon wheels made of turbines.

Single Subject: Subjects on top of a table.

  • Lets see what happens when Flux1D generates a simple image of a fruit on a table.

  • "an apple on a table" 768x768 12 steps seed 420 dcfg 3.5

  • Lets break this first image down into elements. We have the primary subject fixation, which encompasses the entire image aka the apple. The T5 and CLIP_L training identified the apple as the primary subject fixation due to the weights, and then the table becomes the secondary subject fixation based on being a plane. What happens if we just... y'know turn off the T5's use of distilled config though?

  • Lets set it to 0 and see what happens to our apple. dcfg 0

  • The apple DID change, but in a seemingly unknown way. It's really done very little without having the CFG on. This is part in due to the CLIP_L having such a strong understanding of an apple to begin with.

  • What happens if we set the distilled cfg to 30 though? dcfg 30

  • It seems to have... rounded the numbers. Almost created a form of miniaturized entropy. You may not understand yet, but this is crucial in understanding the T5.

  • The T5, autonomously creates order from chaos in a deterministic fashion. It introduces a form of entropy that compartmentalizes everything, whether it was intended or not by the developers. This is just one of T5's many utilities.

  • "two bananas on a table" 768x768 12 steps seed 420 dcfg 3.5

  • "an apple an orange and a pear on a table" 768x768 12 steps seed 420 dcfg 3.5

  • "a gnome and an apple an orange and a pear on a table" 768x768, 12 steps, dcfg 3.5

  • "a gnome with an apple an orange and a pear on a table" 768x768, 20 steps, dcfg 3.5

Subject Angle: Viewing subjects on top of a table.

  • We can tell FLUX how we want view things at different angles, so lets try some out.

  • "an apple on a table viewed from above" 768x768, 12 steps, seed 420, dcfg 3.5

  • Lets try some more objects, because there's only so much we can do with an apple.

  • a sealed box of cereal on top of a table.

    "FLUX FLAKES"

  • view from above a sealed box of cereal on top of a table.

    "FLUX FLAKES"

  • Hmmm looks like view from above produces something else. If you check my other subject fixation article, you'll see me going into detail about how views affect image outcomes based on specific word usage, especially when using "view".

  • from above a sealed box of cereal on top of a table.

    "FLUX FLAKES"

  • BOX OF, see the issue here? BOX OF something, instead of "cereal box".

  • from above a sealed name brand cereal box on top of a table.

    "ABSTRACTIES" name brand,

    "FLUX FLAKES"

  • Associative integration, in other words the table's angle is superimposing the image, rather than the cereal. This is a T5 thing, so lets go ahead and fix that shall we?

  • above a sealed name brand cereal box on top a table viewed from above.

    "ABSTRACTIES" name brand,

    "FLUX FLAKES"

  • Now you understand the why, now how do we utilize this in every situation so it doesn't form a landscape of something?

  • Well, there's a few ways. Primarily you want the T5 to fixate on angles globally, that way it intentionally trickles the angle through the image in a uniform way; Like so.

  • a photo from above a table.

    there is a sealed name brand cereal box on the table.

    "ABSTRACTIES" name brand,

    "FLUX FLAKES"

  • As you can see, we've achieved the goal. The T5 can be trained and simplified in this realm if you wish, since the subject fixation is a bit tricky with wordplay, but be careful you don't train contradicting tags and information if you do, or you'll end up with chaos.

Subject Depth: Distant subjects on top of a table.

  • When making any image, we need to determine fixation, rotation, and depth. The entire vibe of an image can shift if you aren't careful with it.

  • Lets do some experiments before I get to what works, so you see what happens.

  • a distant depth of field view of a table with an apple on it. 768x768, 12 steps, seed 420, dcfg 3.5

  • Well it's doing what we expected, just not how we expected it. The fixation didn't shift, but it applied background elements. So we can assume depth of field and distant aren't exactly useful for fixation. Lets shift the fixation, to something closer than the apple and see if that works.

  • INSTEAD of starting with the apple, since we aren't technically fixating on it until later, lets start with the actual room itself, then fixate on that.

  • a very large and deep room.

  • Not what I had in mind, lets go a little more... empty warehousey.

  • a very large and deep empty warehouse.

  • a very large and deep empty warehouse.

    there is a table in the distance.

  • Getting closer, now lets put our apple in there.

  • a very large and deep empty warehouse.

    there is a table in the distance with an apple on top of it.

  • *squints* So, lets try cranking up the T5, see if it'll help.

  • a very large and deep empty warehouse.

    there is a table in the distance with an apple on top of it. dcfg 10

  • Still not working huh. I can tell you why, but it'll be easier to show you.

  • "An expansive, dimly lit warehouse with an apple positioned far in the background on a small table. The warehouse is mostly in shadow, creating a sense of vastness and isolation. The apple is illuminated by a soft, faint light, emphasizing its distance from the viewer. The foreground is dark and empty, with a clear line of sight leading toward the distant apple. Reflections on the slightly visible floor subtly guide the eye toward the back of the warehouse, where the apple is the only illuminated object in the scene." steps 50

  • Flux, at least the fp8 version, has a fairly fixated grasp on subject. It sees the apple as the fully dominant subject here, which means no matter how complex you make the fixation around it, unless there are more dominant subjects, it'll probably just fixate on the apple. This is a common thing with FLUX, and why I actually trained the model a bunch. My versions are meant to produce a more prominent fixation on human and humanoid form. There are many other trainings of flux that focus on environment, groupings, large arrays of objects, and so on.

Concept 2: Multi-Subject Fixation

Vertical Subjects: Table on an apple!?!?

  • What do you mean table ON AN apple, you're probably wondering. I'll just show you, since seeing is believing.

  • "a table on an apple" 768x768 12 steps seed 420 dcfg 3.5

  • Not exactly the intended results, I wanted the table to be ON the apple. In A WAY it is, but it's not what we wanted. We can say that the table IS "ON" top of the apple, due to the table breaching the foreground and the background, and it even introduced a new element. Lets just assume the T5 didn't understand what we wanted, so lets give it some additional simplified instructions.

  • "a small table on a large apple" steps 12, 768x768, dcfg 3.5

  • Why did this not work you ask? Simple, I picked something that wouldn't work. I'm doing this on purpose to showcase the importance of word and prompt choices. Now, lets make it actually work shall we?

  • "a small table on top of a large apple" steps 12, 768x768, dcfg 3.5

  • As you can see, the goal has been achieved. Simple in the end. The apple exists in void and the table is on top of it, rather than the apple being on top of a table. Pretty straightforward fun stuff to work with. EZ PZ SQUEEZY APPL- alright none of that now.

  • Now lets crank the T5 to 30 and see what happens shall we?

  • "a small table on top of a large apple" steps 12, 768x768, dcfg 30

  • The table seems to have spawned another table on top of the table, and the apple is now in nature. Lets reduce the dcfg just a smidge, lets go with 5.

  • "a small table on top of a large apple" steps 12, 768x768, dcfg 5

  • Alright that's all you need to know, when stacking two baseline objects.

Stacking Three Subjects: Table on an apple on a table!

  • Lets just get right to it. We have to figure out the words needed to make this happen, pretty simple stuff really if you just logically break down the goal.

  • Traditionally CLIP_L responds better to DIFFERENCES when it comes to different types of objects nearby other objects, however we have a NASCAR grade driver here the T5 so lets just assume it'll understand what we want out of the gate and get to it.

  • "a small table on top of a large apple on top of a giant table." 768x1024, steps 12, dcfg 3.5

  • *squints* seems we have a problem. We've breached the 3 row paradox as I call it, where the system begins to break down once you hit a layered system of 3. I had a lot of trouble breaching this value in my experiments using region prompting. Flux, does best when controlling a 2x2 grid, unless you trick it.

  • three subjects.

    small table on top of large apple with a larger table under apple. 1024x1024, steps 12, dcfg 3.5

  • So what happened here, is my prompt combined the table and the apple into a single subject. This image does in fact have 3 different subject types; the background, the table, and the apple. Each of those are technically different entities, but it combined the apple and the table together as a single binding agent. Ungluing these two objects after fusing them is unlikely. They are now one.

  • This is a flux problem. You have to solve it as the designer AND the engineer, so lets get into it. You have to think DIFFERENTLY to do this, less like prompting a machine, and more like requesting something from the LLM.

  • "there is a much larger apple than the table it's resting atop, and another table with chairs and an umbrella resting atop that large apple." 768x1024, steps 12, dcfg 3.5

  • So now we've successfully asked the T5 for help. This isn't always successful mind you, since the CLIP_L likes to get in the way. However, you can empower the distilled config to improve T5 response when you are getting worse outcomes than expected.

  • "there is a much larger apple than the table it's resting atop, and another table with chairs and an umbrella resting atop that large apple." 768x1024, steps 12, dcfg 10

  • Lets make the T5 have a whole garden party on top of our giant apple shall we.

  • "there is a much larger apple than the table it's resting atop, and another table with chairs and an umbrella resting atop that large apple. There is a group of people sitting at the table sipping drinks relaxing." 768x1024, steps 12, dcfg 10

  • Maybe we want them to be wizards instead.

  • "there is a much larger apple than the table it's resting atop, and another table with chairs and an umbrella resting atop that large apple. There is a group of wizard people sitting at the table sipping drinks relaxing." 768x1024, steps 12, dcfg 10

  • Now that we have an understanding of how to differentiate through stacking basics, we can move onto the next lesson.

Side By Side Subjects: Apple Macarena

  • Lets start by lining up a few apples shall we.

  • "three red apples side by side." 768x1024, steps 12, dcfg 10

  • Seems T5 may have gone a little overboard there. As you can see, leaving the distilled cfg at 10 isn't always a good idea, since it can produce blurry images when simplistic concepts are in order. This is where the CLIP_L excels.

  • "three red apples side by side." 768x1024, steps 12, dcfg 3.5

  • You can often spot a CLIP_L heavy image, expecting something symmetrical and instead getting overlap, gaps, spaces, offsets, and so on. CLIP_L is OKAY at everything, but it suffers a lot in terms of logical deduction. It has some, and more in there is useful for our current task than the the distilled CFG usage. This sort of thing is perfect for generating in Schnell, but we're using 1D currently so lets continue.

  • "three people wearing red apple mascot suits standing side by side facing towards the camera, floor." 1024x768, steps 12, dcfg 3.5

  • Apple apple AAPPLE!!! err.. yeah. If you've gotten this far you either have a thirst for knowledge, or a sense of humor and are following my slump into insanity. Either way, they're about to do the macarena.

  • "three people wearing red apple mascot suits standing side by side facing towards the camera, floor.

    They are dancing the macarena." 1024x768, steps 12, dcfg 3.5

  • So what happens when we juice up the distilled config on our dancing apple mascots to 10?

  • ENTROPY! We have successfully damaged our image using the T5. NOW, we can fix this, if we juice up the steps like so.

  • "three people wearing red apple mascot suits standing side by side facing towards the camera, floor.

    They are dancing the macarena." scheduler - simple, 1024x768, steps 50, dcfg 10

  • Why? Entropy. It collapsed the complexity into simplification and normalized it, then introduced the normalized values. It's introduced more elements that it assumes will be near our macarena dancing apples, which includes an audience of onlookers due to being mascots, and a location fitting to the "floor" tag, so it assumes a dance floor would be my guess. All of these are essentially associatively linked concepts, I call them bleed-overs from unintended or potentially badly tagged large scale trainings.

  • Maybe they were necessary, maybe they are byproducts. I have no idea! Lets use the Normal timestep instead now for our sampler, see if we can get some different results!

  • "three people wearing red apple mascot suits standing side by side facing towards the camera, floor.

    They are dancing the macarena." scheduler - normal, 1024x768, steps 50, dcfg 10

  • Nearly identical. We can just assume they are going to produce nearly identical results when you're juiced to this point of distilled cfg.

Coloring Subjects: Mixed Apple Macarena

  • "three people wearing red apple mascot suits standing side by side facing towards the camera, floor.

    the apple on the left is green.

    They are dancing the macarena." 1024x768, steps 12, dcfg 3.5

  • "three people wearing red apple mascot suits standing side by side facing towards the camera, floor.

    the apple on the left is green, the apple on the right is yellow.

    They are dancing the macarena." 1024x768, steps 12, dcfg 3.5

With that concludes the basic lesson of T5 subject control with flux. More lessons to come.

5

Comments