bigASP 3.0 Progress Update (and 2.6)

For those one or two people using the bigASP models, here's an update on my work.

New Tag String Generation

bigASP has always focused heavily on photoreal, but during bigASP 2.5 I experimented with adding an anime side to the dataset in the hopes of expanding the concepts that the model understands. The theory is that the model could take concepts from the anime domain that don't exist in the photoreal domain, and use them to generate new and interesting photoreal images. The experiment was a mild success. But at the time I spent no effort on the anime side; I just slapped it in and did the most basic of captioning and tag string generation.

I believe this lack of effort on the anime side resulted in two issues for bigASP 2.5. First, concepts that exist on the anime side but are underrepresented (less than 5k examples) did not get learned. Second, the model behaves a bit too creatively.

Personally I've always liked the creativity of bigASP, but one specific issue with 2.5 is that it will bring in all sorts of different concepts (that were unspecified in the prompt). Some of that is okay. During generation I usually start with a simple prompt and let the model inspire me before I slowly narrow the prompt down based on ideas it's given me. But if the model's creativity here is too extreme it can make the model too difficult to use.

I believe this is a result of using a purely random tag string generation algorithm. i.e. every tag has equal probability to be dropped. Dropping tags is important to help the model learn to be robust and "fill in the blanks" in short user prompts. But if every tag drops equally then the model weights all of those tags equally in its creativity. As an example, take a rare concept like "doing a headstand". Generally speaking if you're generating a character portrait you don't really want an image of someone standing on their head. Yet with uniform dropping the model thinks that's a possibility.

So we really want some concepts to be things that only show up when explicitly asked for. It's hard to know exactly which concepts we'd want this to apply to. So my shortcut was to use the frequency of a concept (as represented by a tag) as a proxy for how often people would want it to randomly show up in their generations. That is, the rate at which a tag is dropped from a prompt is inversely proportional to its frequency in the dataset. If a tag is common, it'll get dropped more often, meaning the model will be more likely to put it into a generation when not told otherwise. Which makes sense for things that appear commonly in images. For rare tags, these never get dropped, therefore the model will only generate them when asked.

As a side benefit this also helps the model learn rare concepts, since the tag is always present. i.e. it gets exposed to that concept 5k times instead of 2.5k.

In practice I establish two thresholds. Below the lower threshold the tag is never dropped. In between the two thresholds the drop rate is inversely proportional to tag frequency. Above that dropping is the usual uniform rate. I also set the min drop rate to 10% so that even for rare tags they still get dropped occasionally. Just so the model doesn't get too overtrained on them. These thresholds let me specify what a rare tag is, what a common tag is, and what's in-between.

New Quality Model

Throughout the bigASP series of models I've always built the quality model, the model that assess the quality of images in the dataset, last. Which means I never put much effort into it. She was long overdue for some TLC.

Thus was born JoyQuality. This is an Image Quality Assessment model trained on a carefully balanced set of 100k preference pairs, at 512x512 (compared to the 224x224 of my previous quality models). On top of JoyQuality I tuned for bigASP specifically using my personal dataset of 6k human preference pairs.

The end result? A much better and more nuanced quality model for bigASP 2.6 and 3.0. This should hopefully improve the quality of generations as well as helping instruct the model to generate more details. (The 512x512 quality model is great at picking up softness in images even on 1024x1024 images).

Camera Signals

Information about a photo like what camera was used, what lens, focal length, etc, can all make great knobs during image generation. To that end I put effort into collecting lots of ground truth data here and folding it into the image captions where possible.

Colorspace Correction

My data pipeline now correctly handles the different colorspaces of images correctly. This means, in contrast to the vast majority of models, future versions of bigASP will be trained on color accurate images. In the past about 10% of images in my dataset had special colorspaces or color information, which means they were being fed with incorrect colors to the model.

bigASP 2.6

To test a few more ideas before 3.0 I've trained a small revision on top of bigASP 2.5. Basically just all of the above dataset tweaks, along with some higher resolution training (up to 1.5MP). Training is complete, and the model is now going through my first attempt at RL post training a diffusion model. This should hopefully stabilize the model further. Once that's done I'll release the model publicly, as always. (As well as the model before post training, for posterity).

bigASP 3.0

With all the peace and love in the world, I'm tired of SDXL. I'm tired of the busted hands and faces, the bad text, the terrible textures, you know it all. The open source community still doesn't have a great replacement after all this time! Illustrious is really cool, but it's still SDXL based (with all the problems that comes with) and from my experience it has a terrible case of not being very creative. That's not a slight to the model or the creator's efforts by any means. A strong, but creative, model is very difficult to make. Flux.1 is neat but heavy, censored, and limited. Chroma is doing good work but is also somewhat heavy, and from what I've seen is a bit temperamental. And it sounds like Pony V7 needs a rev? (I've been a bit out of the loop there lately).

I really, really want bigASP 3.0 to be like my Pony V6. I had tons of fun playing with Pony V6 during its hay day. bigASP 2.5 is close for me, as close as I've ever gotten to that golden combo of robust+creative+broad, but not quite there. I want all of that, on a modern model, with strong photoreal capabilities. That's the goal.

Unfortunately, again after all this time, there still isn't an "easy choice" modern base model to train on top of. Who knew we had it so good back during SD1.5 and SDXL?

Here are the current options: Flux.1 (apache variation), Chroma, QwenImage, Wan2.2 14B, Wan2.2 5B, dozens of stray underdogs, and a custom model. I'll dive into my analysis of each, but first here is how I'm approaching things:

I want to make a model that's fun and easy to use. That means there needs to be wide UI support for it, and inference time needs to be fast. I believe the usefulness of a model depends on a combination of inference speed and generation quality. A model with low generation quality can make up for that by being fast. As an example, SDXL. You just generate 8 images at a time and spin a few times until it "hits". Which is cheap to do. But if it took forever to generate each image, no one is going to go through that effort for a rare "hit". All of this to say, if I pick a model with slow inference speeds then I have to ensure the model has superb generation quality. That's a high bar with higher risks, especially for a hobbyist like me. So I much prefer lighter, faster models that enable end users to spin lots of times.

Photoreal is important, which means the model needs to have a good VAE. Anything is better than SDXL's VAE, but the quality of each model's VAE varies widely.

The model needs to be a good learner. That's hard to quantify, but generally speaking it means the model takes well to being finetuned. For whatever reason, some models can come out of the oven as poor learners, even if they themselves are decent or even great models. It's hard to say how much this matters at larger scales (as Chroma proves), but it's worth at least considering.

Finally the text encoder is important to me. It is my belief that T5 is a huge mistake. It's a piss poor text embedding model for a model that's doing image generation, especially compared to CLIP. And I think it's leading to a variety of prompting difficulties in all modern models. It's also flat out insane to use a 5B parameter model for just the text encoder. Maybe if it was doing some planning work like CLIP does, but it doesn't. It's only embedding the text. A complete waste of space and compute.

Now for the model analysis:

Underdogs - I don't really have the time/resources to go through every diffusion model that exists. I swear a new one drops every week. It's possible there's a gem in there, and it might be worth me running a large scale evaluation, but if I can find something quicker then I can get to work quicker. Also UI support is better for the "top models" than it is for the model du jure.
Chroma - Pros: cool model. Lots of potential. Smaller than Flux. Diverse, uncensored training. Cons: T5 only. Slow inference (though one of the better amoungst modern models). Might not be a good learner. I suspect with the extreme training schedules its gone through and the heavy surgery, that the model might be "brittle" with respect to more surgery. Though that's entirely a hunch on my part.
Flux.1 - Pros: Flux is a great model. The best VAE out there. Its VAE's reconstructions are nearly indistinguishable from real images on all of my torture tests. Has CLIP inputs. Cons: Poor learner. Needs to be cracked. Slow inference. If T5 is dropped it would be limited to 75 CLIP tokens of prompt, since AFAIK no UIs do prompt extension tricks on the CLIP models for Flux.
QwenImage - Pros: Great model. Large models learn fast. Feels fairly raw, so probably a good learner. Has Long CLIP inputs. Cons: Absolutely huge, which means it nearly useless for most people. I'd have to drop all but the CLIP text encoders, which I don't know if all UIs support?
Wan2.2/2.1 14B - Pros: Great model. Many people already use it as-is for T2I. Good learner. Should have lots of nuanced knowledge gained from its video training. Cons: T5 only. Large model (larger than Flux). Poor VAE. (Wan2.2 A14B uses the Wan2.1 VAE, not the newer one).
Wan2.2 5B - Pros: Perfect size. Should have lots of nuanced knowledge gained from its video training. Training was large and diverse, AFAICT. Fast inference (perhaps faster than SDXL!). 2nd place VAE. Cons: T5 only. It's not a great model off the bat.
Custom model - Pros: I can pick the perfect architecture. Cons: I'd have to work to get support in different UIs, which is impossible in dead UIs like Forge and Auto. I'd have to train it from scratch. I suspect that's less of an issue than one might think. There's LightningDiT and similar techniques to make it cheaper, as well as distilling a big model like Qwen into it. But still, it would be more compute than a simple finetune.

Out of all these options my current plan is to pursue two: Flux and Wan2.2 5B.

Flux

The biggest reasons for picking Flux is that it's a well established model, and it has the best VAE. All UIs support it. It's also known to be trainable, as proven by Chroma. Though I'm hesitant since Chroma's training is larger scale than what I do.

Basically this is my safer pick. Given enough resources, it will definitely work. It just has a few sharp edges like being a bit heavier than I would like, and potentially requiring more effort and compute.

Wan2.2 5B

This is my bet. It's the perfect size. Bigger than SDXL for a nice boost in model strength, while still being smaller than everything else. Based off parameter count alone, it's a nice middle ground, which makes it attractive as a base for "The Working Man's Model." As if that wasn't enough, it might be even faster than SDXL! I'm not entirely sure yet (busy on the training side), but it runs at 32x32 versus SDXL's 128x128. It's hard to directly compare, because SDXL only uses attention a quarter of the time. But at least on the attention side of compute SDXL has 16,609 tokens to attend to, versus 1,536 for Wan!

(The reason why Wan's operating resolution is so low is because its VAE, despite coming in 2nd place overall, has 4x higher spatial compression than Flux's, taking images down to 64x64. And then the model itself does another 4x shrink to get to 32x32 for the bulk of compute.)

So Wan2.2 5B should be cheap to run, and cheap to train. All good, right? Well, that's a double edged sword. Compute doesn't scale exactly with generation quality, but they are proportional in practice. Flux, etc aren't wasting all the extra compute they do as a result of their larger models and larger contexts. So I'd have concerns about the peak quality Wan2.2 5B could reach. But it's worth finding out! And it's possible to work around the issue by scaling up the image inputs. At 4MP, the model would run at 4,608 context length, increasing the amount of compute it uses in favor of more quality. (And, of course, more training compute can always be used to drive small models to be equal to larger models).

After much effort I do have Wan2.2 5B training on a small scale run of 1M training samples. There have been a few bumps though. Even with precomputed the latents and text embeddings, thus removing the VAE and 5B parameter T5 from the training run, the model will not fit on my 96GB GPU. The model itself is 20GB, plus 60GB for optimizer state, plus activations and such. I just couldn't get it to fit. I could use a different optimizer, but AdamW remains the gold standard so that introduces engineering time and risk. I could do stochastic bf16 training, but I had bad luck with that previously.

Which means I'm currently stuck with slapping a DoRA/LoRA/etc on the model and training it that way. With two H100 NVLs and a DoRA on all linear layers (rank 128) it's running at 7.5 images/s. (I haven't fully maxed out batch size, nor turned on compilation yet).

So it's taking me around 38 hours to do a small run. Also, by switching to lora the hyperparameters are different compared to a true finetune. I have no clue what they would be for this scale of training... So it's 38 hours for each tweak...

I could alternatively rent a machine with more VRAM, like the H200s. But that's expensive for these initial "feeling things out" runs. I'd guesstimate it at $90 per small run if I did that.

Regardless, I'll do my best to dial things in. I haven't finished a training run yet so I don't know how well the model is taking to things. Though somewhat promising is that the loss is starting out at 0.29 compared to bigASP 2.5's 0.5. That either means the model is significantly stronger right from the get-go, which would be great, or I fucked something up. (Or maybe its VAE is easier to predict? Though it's clearly getting more information through the latents than SDXL's so I doubt that's the case).

Current risks:

Will it train well?
Will it be able to use its excess capacity to improve things quickly now that it's being trained for T2I only?
Will T5 be a problem for prompt generalization, like it is for all other modern models?
How long will it take to learn the new resolutions? It was only trained at two resolutions by Wan, if I understand the paper correctly. Usually not an issue, but it can sometimes take models awhile to get fully comfortable with new resolutions.
How much training overall will it need to adapt to my dataset? My previous scale of training is doable on Wan2.2 5B, but if it needs more that's going to be painful.

(Side note: As Chroma notes, all these models don't pass attention masks to the core transformer for the text embeddings, which means they attend to the padding tokens on the prompt. That can cause subtle issues where the model gets sensitive to the amount of padding tokens. It's also more expensive for no reason. But since I'm aiming for UI compatibility as best as I can I might be stuck with this for now. Or I can maybe do something during training to make it happy in either scenario.)

Anyway, wish me luck?