Towards Pony Diffusion V7, going with the flow.

Hello everyone! Finally, it's time for some updates—I know it's been a while, and huge apologies for the wait, but technical work took priority.

There's a lot to cover, so check the TL;DR for each section if you're in a hurry.

Base Model

TL;DR: I'll be training on AuraFlow first, with FLUX as a secondary backup.

After evaluating several model options over the past few months, I've picked AuraFlow as the primary Pony Diffusion V7 base. It's a robust model architecture with excellent prompt understanding, and it's licensed under Apache 2, which aligns with our goals for monetization. I am very impressed that it's a project by a very small group of talented engineers; Simo is doing amazing work leading it, and the FAL support is inspiring and respect-worthy, so on a personal level, I admire this effort and want this model to succeed. While AF's tooling and aesthetics could be improved and it currently lacks a 16-channel VAE, I'm confident these issues are not deal-breakers and some can be mitigated with time.

FLUX is the latest hot topic, and it's great to see the original diffusion team back in action. My hesitation with FLUX lies in its licensing and training complexity. Only the FLUX.1-schnell version is Apache 2, meaning we'd need to train Pony Diffusion on a distilled model, which, while possible, is uncharted territory when it comes to fine-tunes of Pony scale. It is great to see tooling mature at a rapid pace, yet I remain cautious. FLUX is a great backup option if AF fails for some reason, and I will be running some early experiments on it.

Other considerations include SD3, which has slightly improved its license since the last time it was discussed, but the model itself remains underwhelming, especially compared to the competition. I don't see much hope for SAI's direction, though I'd be happy to be proven wrong.

I'm also seeking community feedback on the SDXL version. It was my primary candidate until AF and FLUX emerged. While I've heard some users still want an SDXL option, I'd prefer not to introduce a third version. If AF or FLUX can meet the demand for performance on medium-range hardware with solid tooling, an SDXL variant might be unnecessary—but I'm open to your input.

One more thing: the Open Model Initiative (OMI) is something to watch. It's a promising collaboration aiming to build fully open-source models, and despite all the permissively licensed models available now, a truly open end-to-end model remains an unsolved problem. I am happy to share my expertise with the group, and while it's not something to come out soon, I am excited about the possibilities.

Captioning

TL;DR: Pony now uses GPT-4o-level captions with state-of-the-art character recognition and NSFW support, though captioning such a large dataset takes time.

High-quality captions are crucial for model performance, as we've seen with PD V6 and many other newer models. Poor captions will undermine even the best models like AF or FLUX, so my goal is to generate dense, detailed captions that cover the entire content range—no small feat given most current vision-language models (VLMs) are either censored or lack domain-specific knowledge we need.

To improve captions, I've started by enhancing the tag-based prompting already used in V6 to better recognize and focus on special cases, like character names. We've also created and curated a set of over a thousand detailed and opinionated captions to guide the VLM output, avoiding common pitfalls like filler phrases ("The image depicts...").

The VLM evaluation process was quite time-consuming. The first main candidate was COG, which I generally had a positive experience with. It responded well to tag-based prompts, was only lightly censored, and was receptive to fine-tuning. However, the caption quality was just a bit lower than what I wanted, and securing the right license proved problematic as all my attempts to get in touch went unanswered.

Next, I explored Dolphin 72B, another excellent model with no censorship and even better prompt adherence and general knowledge. Its primary drawback was slightly inferior OCR capabilities compared to COG, and it tended to create "cute" hallucinations—adding sensible but absent details to images. While fine-tuning Dolphin was challenging, it wasn't impossible, and we successfully generated the first batch of captions using this model, though its large size slowed down the process.

Fortunately, I was introduced to a smaller alternative, InternVL2, specifically the 40B variant (there is also a 76B version of InternVL2 but it was not a noticeable improvement in my tests) . This model proved to be even better, reaching GPT-4 levels of captioning with superior prompt understanding, better OCR, more domain knowledge, and no censorship. As a result of this evaluation, InternVL2 is currently the primary captioning model.

Florence-2 also deserves a mention. From my experiments, it's an amazing and extremely compact model. However, it doesn't handle complex tag-based prompts as other VLMs can due to a different architecture. I might use outputs from larger models to train Florence-2 for faster captioning and am very excited about having a very small model like this. Given that the larger models are way outside of even high-end consumer GPU capabilities, having a smaller version to aid with captioning for LoRAs is critical.

The biggest challenge remains running the captioning on the entire training dataset. If you're a company (or a suspiciously wealthy furry) interested in being featured in the upcoming V7 release and have access to servers featuring 80GB+ VRAM (or are willing to rent some), please contact me on Civit or at [email protected] to discuss partnership opportunities.

Aesthetic Classifier

TL;DR: The V6 classifier works well for V7 but has been updated to reflect new data types.

I recommend checking out "What is score_9 and how to use it in Pony Diffusion" for the context of what an aesthetic classifier is and why it's important for Pony Diffusion. When training V5/V6, I used a CLIP-based classifier, eventually settling on the ViT-L/14 version of CLIP, which is the largest and last model released by OpenAI. Although I was generally happy with its performance, I had concerns about potentially using the wrong tool for the task, or not using the best CLIP model available, as many versions have been released after the OAI models.

Before selecting images for V7, I conducted extensive testing with different CLIP models and Visual Transformers. I discovered that ViT models, while demonstrating strong performance, lacked alignment with aesthetic understanding, as they were not exposed to aesthetic samples on the scale of CLIP models and were more data-hungry. For instance, they would rank specific visual elements, like certain poses, disproportionately high, regardless of other factors as soon as I would add a few similar rankings to very different images using similar poses. Despite trying to manually adjust this by reviewing a large sample of differences between old and new models and adding more human data, this turned into a Whac-A-Mole.

In contrast, multiple CLIP models, from the smallest to the largest like EVA-02, showed better alignment with aesthetic understanding right from the start. However, their overall performance wasn't as precise as ViT or the old model. Out of desperation, I ported the old OpenAI ViT-L/14 CLIP model to the new pipeline and immediately saw the best results. My theory is that although it performed worse on benchmarks, OAI had trained the model on a much more diverse dataset, which performs better on real-life tasks. Although it felt somewhat bittersweet to "waste" so much time, I am pleased to confirm that the approach I used for V6 was justified and still useful. As a final step, I've added 10,000 more human ratings to better cover photorealistic images, and I have also started a separate Elo-based human feedback collection pipeline to obtain more precise ratings (by picking the better image from two similarly ranked ones), but it will take some time for this to make a significant impact.

I will release the classifier after the release of V7 so you can add aesthetic data to your prompts when training LoRAs or merges.

Super Artists

TL;DR: V7 will offer generalized styles without direct artist style copying.

Pony has always carved its unique path, which I hope contributed to its success. A defining characteristic of the model is its avoidance of specific artist styles; however, the weak style control Pony offers has been clearly insufficient, as evidenced by the popularity of LoRAs that implement both general and specific artist styles. Enhancing style control has always been a core priority for V7.

In the first step, I've developed a new model capable of distinguishing artist styles, employing techniques somewhat similar to those used in the aesthetic classifiers. I've evaluated multiple architectures based on ViTs and CLIP, different fine-tuning strategies, and use of different types of embeddings. Unlike the problems I've experienced with the aesthetic classifiers, for this task, I had access to vastly more data, which proved crucial in unlocking ViTs' performance.

An intriguing discovery was the diversity within some artists' work. I've always expected artists to have more than one distinct style, i.e., "sketch" versus "full color," but most artists with more than a few dozen works demonstrated more than two core style clusters and a long tail of "experimental" ones.

Now, equipped with a network capable of producing artistic embeddings, I can cluster and tag images in the training dataset with more generic yet diverse styles, like 'anime_42'. There's still some work needed to ensure these clusters don't closely mimic existing artists, but overall, the results are promising, and I believe this area is largely de-risked. We will have to wait for the model to be trained to fully evaluate the impact, but I am pretty optimistic at this point.

I am also working on a backup plan in case this does not perform well, in addition to content captions I am doing a style specific captioning run which only focuses on describing the style and artistic properties of the images.

While not fully committed yet, I consider releasing tools that would allow users to discover similar styles based on a specific image input, simplifying style discovery.

Dataset

TL;DR: Better data selection means Pony can now handle realism too.

I've nearly finished selecting 10 million high-quality images from a 30M+ dataset, with 8M already finalized. The dataset now has a stronger anime base, refreshed pony/furry/cartoon content, and for the first time, strong photo additions. Overall, the dataset has been balanced to be slightly less NSFW. I've also added experimental features like scene color palette tags for better color control, and the artist blocklist has been updated to catch more instances where character names are detected as artists and removed.

I will provide a more detailed breakdown when the selection is completed, but as of right now, the model consists of the following general components: 10% Pony, 10% Furry, 20% Western Cartoons, 25% anime, 25% realism, and the remaining 10% being miscellaneous data. You may be surprised that the amount of Pony content is lower than in V6 (especially given that we are Pony Diffusion), but these are relative numbers, and we actually have much more content of every kind. It's just that in some areas, we are "done," i.e., there are not many high-quality images that can be added anymore.

Some additional work remains to confirm that all data is compliant with our safety framework, but at this point, largely everything has been completed. We'll release safety classifiers and a character codex post-V7 as part of our safety commitment.

Next Steps and Going Forward

TL;DR: Training is imminent.

Small-scale fine-tuning will begin in a few days to ensure the training pipeline is ready. While aesthetic classifier tweaks, captioning, and VAE caching are still in progress, I am close to starting full-scale training. I appreciate your patience and hope we can capture lightning in a bottle once more.

And a closing note: I am very excited about the state of infrastructure and the datasets I am working with. Going from V6 to V7 required a lot of rethinking and reworking, but I am finally happy with the process and expect further versions to have a much shorter prep time. I was also able to collect a massive amount of video training data, so I am excited about T2V opportunities in the future.

If you enjoy using Pony Diffusion and want to support it please consider joining our Discord (you can even subscribe to help the project) or keep using the Civit generator as it's now sharing the buzz with creators (and you can even make the creator cut larger by increasing the tip).