Mega models incoming?

Introduction

No matter how we look at it, this T5xxl LLM that FLUX was trained on is 12 billion parameters. There's an absolute ton of information in the t5xxl and it's highly capable, albeit a bit limited in comparison to the bigger brothers and sisters like LLAMA and GPT4o, but most people don't get access to running the best models (100b+) on their pcs, so I think they picked this one in particular due to it's power and reusability. Most people don't have 4090s while a large portion of the best models can't even run with 24 gigs of vram.

Preface

Based on an assessment of the various combination methods we've seen thus far, and the assessments from companies like OpenAI declaring that multi-models are likely not going to end up being the end-game, I've come up with a hypothesis and a combination of potentials for exploration in the image generation field. I don't know how many of them are in fact thought of, if not all of them, but I have to share my findings and ideas nonetheless.

Musings

Systemic tagging has shown high capability so far, as I've shown with my recent article when playing with simple fruit and human tagging order. I've been trying to create systemic tagging patterns in sd15, sdxl and pdxl since I started playing with them to mostly little effect. SOME of them were useful, but most of them aren't very useful and some basically bludgeon an image to death. T5 and FLUX is actually insanely powerful at producing grids of information. It converts many things into many useful things for the other models to process.

This is a bit of a thought and idea development on an experiment I ran in Comfy on a multitude of occasions. Similar to say a model filling in a lined image using sketch style and patterns, the results are intentionally designed to create their own lines before filling them in essentially. Like it's own guideposts to a goal, except iteratively increase the potentials through loopback, introduced, removed, and changed tagging power. There are some tools similar in forge that I've seen for this but they are quite inflexible compared to what I'm proposing that will work with Flux and T5.
There's already systems that introduce sequenced tags based on step sequences, regions, positions, and so on, but this isn't what I'm talking about. It's a different process. We're talking a full high degree internal combative multi-modal model model high dimension loopback with multiple comparative and battling schedulers that reintroduce noise and denoise with multiple model and multiple model types in sequence.
There are some comfyui workflows for this, but they don't quite embrace the... goal I see. I see, something greater. Something fully combinational and combatant at the same time simultaneously similar to FLUX but on a larger scale with many models.

The Mentor Model

I'd say there is a high potential of an upcoming super combinatory model. One that any number of the core diffusors models can be attached to simultaneously, and then this super model will impose it's willpower over the others while making requests to the various blocks that this other models already have pre-trained based on the inference process of that particular model. The vram requirement would be very high so I doubt it'll run on a pc upfront, but the outcome I see is potentially a path towards a model generative hierarchy where one model combination is simply one of many agents of models all working collaboratively together simultaneously.

A model like T5 can be the glue for this entire process to work. One of many, all accessed in a collaborator style, cooperating and competing simultaneously to be judged based on the outcome of the super model.

This sort of thing doesn't work very well with text, since text needs to be absolutely perfect and precise. However, these are IMAGES<<<. There is no necessity to always be perfectly concise and precise. We can create anything. The outcome of our sentences don't need to be perfect, the colors don't need to match, and the art styles are countlessly based on the interpretation of the artist.

The Potentials

Distilled versions of combination T5 unet based models similar to mixtral or the like, where the combined model is an insanely powerful combination welded together with metaphorical steel that can handle lora types of any of the combination pieces, would be a bit rigid, but the concept is potentially a good avenue for large scale testing on pcs.
There are countless other potentials of this particular system I haven't thought of nor researched, however I think the quality and context relational quality output is the biggest and most important aspect. Introducing aesthetics to low detail low pixel count values naturally and heuristically is a natural fixation as well as a byproduct of a concept like this.
A person's face in a window in a skyscraper on an image 35,000x35,000 for example, while still retaining the natural control and complexity of the world around it. This can all be tagged and positionally identified using a tagging system that identifies smaller less complex systems and then creating a combination of those using an llm style model finetuned to do this. It can all be automated using this more intricate multi-modal computer vision combination system to tag potentially any number of highly complex images and segmentation of a high degree.
Build a world at your fingertips and then populate it with chickens. Instantly segment and replace the chickens with geese. Segment and replace the trees with giant bananas. All of this is possible if the request system has the correct and necessary tag requests to the correct and necessary mentor to create it, and T5 or something similar is a perfect catalyst to routing those matching datas to different and yet uniformly identical requests. You should be able to identify one of the 500 visible chickens by the tree on the lower left and change it's beak color, just by clicking on the image and sending the necessary segmentation point to the LLM.
High complexity and high context image combinatory integration is very very possible. Highly complex tag sequences can be used to generate highly complex grids of information and data in sequence. Shapes that represent thousands of tags could be used as subjective prefaces and training guideposts.
Reverse inference would be very possible, since the entire system is based on T5 as it's regulatory core currently. Reverse inference for those who are unaware, is what people do when determining the outcome from stimuli using MRI scans in neurons and nerve centers of living beings. Reverse inference can be applied to AI in similar ways to determine pathological responses, and then the mappings can be used as guideposts for highly optimized routes of re-access for the lesser and more difficult to access high-data high-traffic sections like information shunts.