This is my new working theory on flux after spending to many hours on the FLAN and V1.1 T5 models as well as CLIP-L interactions.
Theory: FLUX was a "Self Learning" model that used the full CLIP vision model and the full T5 with decoders and was trimmed after training.
With a full CLIP vision model you could use a un-tagged data set and then feed the information into the T5 to allow for a self tagging model enhanced by T5 natural language. (Using the existing CLIP 62,000 vocab or a custom one)
After a few 100 epochs you could then use the T5 logic to generate natural language with logic transform such as facial expressions, image reversal, logic regarding placement in scene etc. Then using the CLIP Vision model see if those logic transforms where actually generated by the diffusion model.
Example: An image of a dog on a beach
CLIP vision "sees" dog, beach, outdoors etc,
T5 gets the tags dog, beach, outdoors and generates some elaborate 100-500 word AI BS which is feed back in to be tokenized.
After X amount of epochs on the source data set (Lets say 1B images) the T5 generates a sentence where the dog is in a forest. The diffusion model generates the image then it is run past clip vision, if forest is not a tag then the image is not used for the next training pass.
One of many math problems that I have reached out to many engineers, math PHD's, artist etc:
A 512x512 grey scale image has over a google of possible data combinations
SD 1.5 dimensions allow for a equally complex number of possible combinations 256 exponent of 1024
But yet it seems we have repeating patterns.
One thought I had was the use of data compression IE fourier transform - which in theory reduces the complexity by a power of 10
But maybe its a simple as every diffusion model is being "shaped" by one or two CLIP models and no matter how complex it has to fit in those molds even if you have a google of connections if those connections have to fit through a square mold a circle mold and a triangle mold they would all resemble each other


