TLDR GPT-4o Summary:
Key Points:
Challenges: Flux's inherent complexity makes it difficult to control, particularly due to issues like cross-contamination of tags (where similar tags interfere with one another) and unpredictability in outcomes. High costs, both in computation and resources, make it challenging to prototype new concepts. Inferencing is slow and expensive, even on high-end hardware. Flux requires careful management of failures, such as "concept bleed" (where elements mix unintentionally) and style or color deviations.
Benefits: Despite these challenges, Flux allows for the creation of entirely new interactions and behaviors, such as unique object combinations, animations, and complex scenes like orchestras. With the right training, Flux can generate highly detailed and controllable outcomes, making it powerful for intricate tasks.
Recommendations: The article advises on when to train certain components (e.g., CLIP_L, T5), noting that T5 should be trained for highly layered and abstract concepts, such as complex styles or multifaceted human interactions, whereas simpler tasks can be handled by CLIP_L. Missteps in training need careful monitoring, especially regarding style, color, and positioning errors that can worsen over time.
Complexity of Flux:
Flux combines multiple layers of complexity, making single tag training and associating tags difficult, especially with saturated concepts.
Benefits of Flux:
Allows creation of new behaviors and complex interactions, enabling control over intricate scenarios like orchestras and animations.
Challenges with Cross-Contamination:
Tags often overlap, leading to unpredictable outcomes, and associating complex concepts can result in errors and cross contamination.
Training Costs and Resource Demands:
Training and inferencing Flux is expensive, requiring high-end hardware like A100s to train, making it impractical for individual users to inference and costly for complex and concept training.
Risks of Model Failures:
Flux can produce concept and style errors (e.g., color bleed or misplaced objects), which need careful monitoring during training.
Conclusion:
Flux offers potential for businesses with adequate resources but is too costly and risky for individual use without powerful hardware.
Introduction:
I've been told this system is smart, and it is in a way. Working with T5 for any extended amount of time will show you exactly how smart it really is and isn't, and T5 without training and finetune in the desired direction doesn't always know what to do. In fact, most of the time it doesn't have any idea what it's looking at. Those tokens in that big 13b count are often not trained anywhere near useful, so you sometimes have to give it a nudge in the correct direction.
Flux, tries to combine things. Especially when training with FLUX SHIFT time sampling. It inherently attempts to combine layer upon layer of complexity. It's one of the reasons why it's so difficult to prompt a single word into something cohesive most of the time. It's both it's biggest strength, a huge glaring problem to solve, and it's most difficult to tame aspects. Single tag training.
The woman tag is heavily saturated, which is why when we proc woman, we see a woman. Same with man, girl, and so on. All of them are heavily saturated commonly used words in the dataset, with large amounts of core training.
Training on those tags, is very difficult. Those tags are essentially a form of no-go when training pure concepts... unless you're training the CLIP_L or CLIP_L + T5 (I wouldn't advise training just T5 or just CLIP_L unless you're trying to fix broken training or following a specific concept training guideline).
Training styles with no CLIP_L and no T5 is fully useful and oftentimes produces the best results, but my concept training outcomes show training without CLIP_L or T5 often leaves the new concepts as utter failures and wastes of money.
This no CLIP_L and no T5 lora support problem from a multitude of comfyui nodes, training systems, merger systems, and inference systems; has compounded this already difficult problem into a list of unnecessarily more difficult elements on top of it. I'm in terminal more than I want to be as is, so having to fully run through terminal for certain things has become quite annoying. I don't have time to create full interfaces to streamline this and I'm sure there are others in the same boat.
The Benefits:
You can create new associative behaviors such as; eating a hat, which is what I had to do when I realized Flux1D2pro was actually very very good for training.
You can associate currently existing tags with other tags, which is what happens when you create sex poses, combination items such as hat stacks, tattoos with different associative shapes, and symbols for the system to rationalize.
You can create entirely new landscapes of complex interactions with the correct words and careful training, from landscape murals paintings, to fully fleshed out orchestras where everyone is controllable and playing a unique instrument. Hell, with the right prompting and training, you can make them animated through interpolation. That's the power of this system.
The Negatives:
With any complex system such as this, when linking associative tagging together, you will generate cross contamination and impurities to an already impure system. It's oftentimes more akin to trying to mix oil and water, than it is associative rationalization.
Say one of your orchestral players has a harp. There is already a certain degree of harp training in the system, so you have to assign everything to your harp player from top to bottom, hair to shoes, train the instrument, train the hand placements, and so on. Everything CAN be done, and then you end up with the next problem; the cross contamination.
Well say you pick out 500 specific and unique tags for your harp player and then create an associative interpolation system using those for specific accessor points. Yeah sure, you can make it animate your harp player. What happens when those tokens you're using have cross contamination elsewhere? Unpredictability.
The Dangers:
Training FLUX today isn't cheap. Making it inference something new isn't cheap. Training with a high learn rate isn't cheap. This isn't a cheap model to simply prototype outcomes. The costs come at a very high price, and the individual is hit pretty hard when trying to train these things.
Inferencing the model isn't something everyone can do on their home pc without the running the further Q_8 or Q_4 quantifications, which damage the model in unpredictable ways. Your harp player may lose their shoes, or one hand motion turns the hand into a harp itself. I know that inferencing fluxD with 50 steps is very time consuming even on a 4090 in forge with fp8, so I know everyone else is suffering with that. Flux Schnell is a bit lighter weight so it's likely more meant for consumer grade, while the Flux1D is more meant for commercial use.
Tag cross contamination is a big factor. Say you're training <angle_5_73> as an embed and your entire purpose to proc this data is to use this unique special character. Well, say you've already trained <angle_5_7>. When selecting tags, it's important to check the token sizes using a form of T5's token checker. https://opendemo.ai/tokenizer There is ALREADY cross contamination due to the similarity overlap and the system anticipating utility eagerly. There are ways to solve this, but it's definitely a costly process when you've already trained things. It racks up a further cost on top of a further cost, which is already a problem with the high costs to train it as-is with lots of data.
Markers of Failure:
Look closely at the outcomes. Very closely. Compare all objects that appear with what is in your dataset. Compare everything based on utility and need.
FLUX can handle many high and low resolution tasks, so just because a potted plant doesn't match up doesn't mean it's the end of the world. However, when you start seeing potted plants showing up in sinks and bath tubs, you need to halt and reassess immediately from an earlier EPOCH to determine the damage.
Concept bleed:
Utility-based concept bleed is important. You WANT certain things to bleed together, such as clothing, hats, shoes, and so on. You want the person to be wearing the item, not the item to be floating above their head or their foot be missing.
Take note of small deviances. Small failures. Small style problems and then test if they compound for the next epoch or clear up. Sometimes it can take multiple epochs for things to clear up, other times they only get worse. Sometimes they create entirely divergent objects and entirely new problems such as extra arms, missing hands, overlapping characters, and so on.
Color failures:
Flux is a color beast. It knows colors in and out, it should be doing colors inherently well. If you want finetuned colors that's fine, it'll understand what you want. If you finetune too much however, the color bleeds will spread to other objects, other materials, and so on. You may want some object bleed, you may want some color bleed, which is fine. However, take note if your colors start to permeate everything. If your red is part of eye sclera, if your red blots the sky, and so on. If it's what you want so be, it but if it's not you need to be very wary as to what you're creating now.
Style failures:
Your style should definitely be following specific image guidelines if you wish to create the style. Complex captions are great for more abstract outcomes. Great for more powerful and overwhelming style concepts, unless you want to turn it off.
If you want your concept to be togglable, you cannot blot the sun with it. You need to use something or other to strengthen or weaken this concept, which means T5 needs to be taught what you're actually doing with the world.
When To Train T5:
Train the T5 when you have iterative and conceptualized overlapping topics that can reach a multitude of various facets within an image; comic panels, overlapping stencils, stacked blocks, bowls of fruit, organized people lines, and so on.
Completely derivative and entirely divergent concepts don't necessarily need to be taught to the T5.
Entirely new complex multi-faceted human interactions should be given a simple prompt for T5 to learn.
Complex styles meant to be reusable and shiftable should be taught to T5 so it can differentiate when to expect the style and when to not.
CLIP_L can handle most simple and even a large amount of complex character interactions, but it cannot handle everything.
The T5 is it's combative rival after all, and it's good at colors, shapes, designs, objects, positioning, location, and associative integration with layered images for things like comics and so on.
Trying to train CLIP_L or the UNET for concepts as complicated as comics is a losing battle, unless you teach the T5 what to expect from those comics as well, which likely means an entirely different training run just to teach the T5, due to Kohya having a system without many togglable features for it currently.
Do the benefits outweigh the risks:
It's kind of hard to say today. The system is fairly new and it seems I'm travelling through a small section of jungle with a machete and a dream of uncensored El Dorado.
I'd say there are other alternatives with multi-shot that can achieve many similar goals; such as training detection software with YOLO to identify and inference if your orchestral players are correctly generated and then use GANS to generate elements more efficiently with that instead of fully inferencing FLUX.
There are any number of other AI systems that can accomplish much of what this does, but not all in one neat little package, and the time cost for those other AI systems in manpower and computational power doesn't quite match.
I'd say experts in the other aged fields would be better suited to handle specific tasks for specific needs depending on the identified requirements of the task, rather than placing all the eggs in the basket for something like the FluxD experiment.
For now I say NOT FOR THE INDIVIDUAL, this is not worth the risk to mass train with pure raw captions. That's just like, my opinion man.
For the business, this is likely cheaper than hiring a bunch of experts and organizing a bunch of systems that intermingle and take a long time to generate one frame of an orchestra and then interpolating it with 50 others that could or could not match closely enough to be the next frame.
Pretty much if you actually own A100s this is plenty fine to use. There's a ton of utility and potential here.
A40s are not good at all for latents, even with 8 it estimated 7 hours to cache 15000 latents. I tried to run Dolphin 70b on them as well, but the inference was taking upward of 2 minutes per 100 tokens. Additionally, bucketing images into latents meant it needed to cache about 5 times I think. With a100s the first 15000 latents cached in less than 30 seconds, and the final batch took maybe an hour. Even when using CONSIDERABLY less latents, they take an additionally large amount of time to train FLUX itself.
So... hit or miss. It's expensive for the individual, but there are businesses that can afford to throw money on the pyre for it, so yes and no.