That is the 6 million dollar question now isn't it. How is it REALLY expected to respond?
Well, it dawned on me. There has been an almost constant pattern to the system since I started. I believe I cracked the banana peel for concept to concept interaction in the process.
What is Flux TRYING to do?
If you read the flux papers and follow the specific technical guidelines for T5 you'll get a massive headache. I know I did. So let me just break it down for you based on both a mixture of experimentation and rhetoric from the people who actually made the things.
T5 Hardboiled:
T5 is trying to identify patterns of numbers in conjunction with other numbers. Images and prompts are turned into numbers for it to have a good peek at. That's kind of it's thing, it identifies patterns and then condenses them into different numbers based on those patterns. T5 is adaptable and flexible for different downstream tasks and that's what it was built for, to be reusable. It was trained heavily on LLM concept where it takes in numbers and produces outcome. Simple enough in concept, highly complex in actual practice.
https://wandb.ai/mukilan/T5_transformer/reports/Exploring-Google-s-T5-Text-To-Text-Transformer-Model--VmlldzoyNjkzOTE2 This is a fair article on the subject and does a good job summarizing the important aspects.
I KNOW I did a subpar job describing this, but I was also half asleep while typing it. I'll fix it when I wake up.
Alright so, it turns what I say into other things for the other clip and the unet to process. Yeah that makes sense. This also explains why directives work so well.
WHY does this matter for what I'm describing?
FLUX compartmentalizes. It treats every tag or tag sequence, as though it's another request based on a tokenization for that T5 mixed with it's own clip inference system. The more technical of you probably code things for it, and the more amateur of you have probably just played with it like i have.
The outcome is the important aspect. The practical results of the puzzle and the utility of the problems that are actually solved.
FLUX, is a unique entity on this front.
"a woman"
"an apple"
"a woman with an apple"
WHY did I just post these images you ask? Well, this next image will explain.
"an apple with a woman"
Peculiar. It's almost as though... The particular importance of the developed image, is based entirely on the order.
Notice the fixation of the image itself. The top half of the image is almost entirely devoted to an apple, while the bottom portion is devoted to the woman's fixation on the apple.
Intriguing.
Lets try some different potential interactions a subject could have with another subject and see what we can come up with.
"a woman on an apple"
Still treating her as though she's just a face huh. Interesting nonetheless. This apple has leaves and those are classified as apple, and yet the leaves themselves are cut off at about half of the image.
Yep yep. She's still technically visibly on top of it.
Now lets see what happens when we put the apple on her.
"an apple on a woman"
Hmm... well it's definitely... on her. She seems to be a planetoid of her own in a lot of ways. She BECAME the apple in a sense, like I gave her apple powers.
Intriguing. Now lets try some more complex systems. I'm going to use a 2x2 grid and assume that it's going to display the screen in a 3 dimensional concept, where the four displayed objects are in relative position to the first 4 comma separated subject specific tags.
"an apple, an orange, a banana, a pear"
HMMMM it seems that it didn't like that one bit. Using many tags together without solidifying agents tends to create... blurry messes. It's not quite sure what to focus on or where. It's likely focusing on an object completely off scale.
Lets try it using the conjoin system.
"an apple on an orange and a banana on a pear"
Interesting. we seem to have created a pear orange bowl containing an apple and a banana kind of mushed into the side of a mushed pear from the apple and the orange.
So we have found an interesting conundrum that requires additional information.
"a 2x2 grid, an apple on an orange and a banana on a pear"
*checks* yep they line up and are where they are supposed to be. Now lets make them 3d shall we.
a 2x2 grid overlay of subjects on a table in a 3d environment, an apple on an orange and a banana on a pear
Let's intentionally try to confuse it with a 3x3 grid and a bunch of duplicated information.
"a 3x3 grid overlay of subjects on a table in a 3d environment, an apple on an orange and a banana on a pear, an apple on an orange and a banana on a pear, an apple on an orange and a banana on a pear, an apple on an orange and a banana on a pear, an apple on an orange and a banana on a pear, an apple on an orange and a banana on a pear,"
Did it get it right? Probably not. Did it try? It sure did. It tried it's little flux heart out and there's the results.