It doesn't want to train, and you can't brute force it.
I'm personally having a lot of trouble training something that flux can understand and use. It has TOO MANY parameters, so when I feed it certain short sequences of parameter images like say for anime or danbooru tag training, I end up with a large series of overlapping ideas from the already pre-existing flux training. This has led me down something that I call the rabbit hole paradox. The further I dig, the further the depths go.
Incorrect Assessments:
If you were to say, train something like this on PDXL with 2000 images, you would get 2000 images of causal response in a fairly predictable way. Which is why LESS IMAGES is better for PDXL in a lot of ways. When I train 2000 images into flux, I get an intensely difficult to control monster. A complete beast that can't be tamed no matter how I toggle the configs or tweak the settings.
Failed Experiments:
My first experiment was to train the perfectly formatted and organized Consistency tagset into the system.
Experiment 1: Hello World.
The first experiment was okay and I released it, but it definitely left a whole lot of loose hangnails and problems that needed addressing. The problem was the consistency suffered, didn't introduce enough elements into the system to make it look similar to novelai, or completely gutted the context in one way or another due to overlapping tags.
Experiment 2: Bigger.
So I figured, okay, I'll just burn a shitload of images into it, and then finetune using consistency. So that's what I did. I burned 2000 images into it, and then finetuned the consistency over top. The outcome, was very strange.
It didn't line up with the tags, but it did.
The lower learn rate version would a third of the time produce something from the core flux that I couldn't simply just remove via tags, so powerful that it cut through the entire thing.
The higher learn rate version would completely burn all of the core details, almost so much so that it completely ruined the image quality itself.
Experiment 3: FIRE EVERYTHING!
With 10000 images of carefully formatted and organized training, this was a failure.
I've been reading paper after paper after paper, trying to come up with the answer here. The biggest and most obvious conclusion was just, the more experiment.
The final training was a hail mary. I spent 4 days tagging about 10000 images. A mixture of synthetic and sourced images from danbooru, gelbooru, r32, e621, and more. I was attempting to introduce a plethora of new anime poses, interactions, concepts, clothes, and as many situations and new elements into the system as I could imagine. I am using a carefully formatted and carefully planned out tagged manner, always assuming certain things when certain things are present, aesthetically tagging, quality tagging, and explicit/questionable/safe NSFW tagging.
The low learn rate I had assumed would produce the needed output, but the thoughts, assessments, and everything I learned meant nothing. I know nothing, as I knew from the start like a child and I must relearn everything about FLUX from the ground up to be sure I stop wasting time and money.
Another failure. I'll try to retrain under multiple different situations. I'll be using multiple training programs, multiple methods, and multiple assessments over the coming weeks to try to figure out a way to make this.
Not only was this one considerably worse in contrast to the last ones, but it was considerably worse AND was considerably more time consuming. No matter which step count, gradient, or epoch I tested. The only actual semi-functional versions ended up needing strength 5+ when loading the LORA, but it meant nothing in the end.
My determination says I'll create it, but there's no telling how much work I'll need currently. I'm assuming at least a few months until I get a properly finetuned NAI alternative built into FLUX.
There were SOME successes:
It produces an exponentially good amount of images. It's potential is high, but it's not where I want it.
Core Burn Differentiation:
safe/questionable/explicit:
It DOES handle SFW and NSFW in a high gradient fashion without much of an issue. The differentiation is a problem though, as it does cause a lot of strange quirks when using base flux prompting or tagging.
It very rarely uses NSFW elements in the SFW situations. This tells me it has a fundamental understanding of WHAT IS EXPECTED when situations occur, which means the entire system has to be trained with proper context tags along with the countless other tags that are required to actually train it whatever image you desire in that situation.
CORE FLUX TAG ASSESSMENTS BEFORE TRAINING:
1girl:
This tag is putrid. It's supposed to count and not introduce too many elements. However, it has young girls, children, old women, and so on. All tied directly into the word prompts. My hunch is, it doesn't understand what it is, or it's tag pool is too wide for it to matter so it grabs at everything in that huge list of girl tags. Which makes it even worse. It sometimes produces 500 pound grandmothers wearing a latex bodysuit with the majority of the weight in the thighs, other times it produces the trained data which is a gorgeous doll looking woman with shapely thighs and a narrow waist. It's very hit or miss even when trained.
1boy:
Similar to 1girl it's just supposed to count and introduce a base-form doll it doesn't work. It's just as putrid but also doesn't understand gender very well. You're more likely to get a cartoony character, until you aren't. Then it's just weird.
age <number>:
This is a touchy one and I really don't like that it exists below a certain number. It makes me nervous so I wish to burn the entire number down below 18 when using NSFW elements, but that won't matter if I don't release a FULL MODEL since the lora strength can just be toggled. This worries me, so I plan to completely burn the age 17<<< for anything NSFW, but there's so damn much context that it'll take a lot of work to even see the elements present without a team.
bodyparts:
Legs, arms, hands, fingers, etc. Everything is hit or miss with the core model, and when burning tags into the model, these rarely exist for concepts if you don't have body parts listed in a very verbose manner. If you DO list them, the tag overlap causes a substantial problem with the images, like it introduces hands on hands on hands on hands, feet on feet, and so on. It gets really gross really fast.
cowboy shot:
No negative prompt > everything has a cowboy hat without enough strength in this. Finetune later, everything has a cowboy hat anyway. Even lamps, tables, etc. It gets really weird really fast.
portrait:
This one is definitely everywhere. It can spawn a picture, or a wall of pictures, or a single person's context of a situation, or a series of contexts on contexts on contexts in a situation.
colored <anything>:
It can color things, but not always. More often than a lot of other AI systems I've seen, but definitely less than you'd think based on the system. Most things simply do nothing, or simply don't exist when prompted a lot of the time in contextualized situations. Kind of confuses me. I need to generate probably a thousand or so images automatically and get a mental assessment of this properly.
Conclusion: I know nothing about flux. Lets start over.
As a great man once told me, go into everything knowing nothing and then you'll leave knowing everything about nothing. The Dunning Kruger effect sure stings when it costs money.
I think it's time I teach FLUX how to know nothing, so it can actually learn.