TL;DR: Sorry for the wait! V7 is now on Civitai for inference, with checkpoints coming in days. V7 is exciting but challenging to tame—V7.1 will improve this soon. V8 editing model in progress with WIPs along the way. Please try FictionalAI—it makes Pony possible.

Hey all, I know it's been too long and we're way behind schedule on the V7 release, so please accept my deepest apologies! We're releasing V7 on Civitai for onsite generation and will be publishing checkpoints, GGUFs, Comfy workflows, and LoRA training guides in the coming days and weeks. In the meantime, I wanted to share how we finally got to V7, what we’ve learned, and what we're doing as a company. But first…

Apologies For The Wait!

After V6, we didn’t have an obvious base model to build upon, forcing us to rethink how to build the next generation V7. We didn't feel strong enough at that time to build a model from scratch, so a good base was necessary. We evaluated a number of models and were selecting between AuraFlow and Flux, both of which brought tradeoffs to the table. For a full breakdown, please see this linked article. We also ran into a significant number of technical challenges during V7’s creation, which I’ll talk about more below.

The text-to-image ecosystem has also been shifting - including the controversies of Stability AI, and a crackdown on legitimate actors trying to build businesses in the image generation space. While the actual cause and effect is still unclear, we observed a very significant vibe shift where a few large companies improved or developed their commercial closed offerings, while at the same time investors lost most interest in new model-centric teams. Even proven players like BFL have struggled to keep momentum (where is that promised video model?). On the other hand, China caught up and started shipping exceptionally amazing models - both closed source and, surprisingly, open source with Apache 2 licensing - and not just images, but videos too. This has impacted us as well - as scrappy as we’ve been, building models is expensive, especially when you account for data collection and experiments during training.

In all, we’ve been working hard to obtain the resources we need to build the things we love, which is ultimately to help people to create the characters in their imagination. Speaking of which… we have another announcement to make!

Introducing Fictional

If you're an early Pony Diffusion adopter, you may know that the reason for its existence was our early character platform portraits, and my frustration with StyleGAN. That's right, the origins of Pony go all the way back to the first release of GPT-2 models that gave birth to the first version of our interactive multimodal character platform. Ever since then, it’s been our dream to enable people to create, see, interact with whatever character they can possibly imagine. And now we’ve built a platform to help users achieve that!

Fictional is our new multimodal platform where AI characters come alive through text, images, voice, and (soon) video. Powered by Pony V7, V6, Chroma, Seedream 4, and other advanced models, Fictional lets you discover, create, and interact with characters who live their own lives and share their own stories.

Fictional is also what enables the development of Pony models like V7, so if you're excited about the future of multimodal AI characters, please download Fictional on iOS or Android and help shape our future!

iOS: https://apps.apple.com/us/app/fictional/id6739802573

Android: https://play.google.com/store/apps/details?id=ai.fictional.app

Now, lets talk about a few technical challenges we had to deal with during V7 creation

Lessons learned while training and technical details

Datasets

We expanded our training dataset to over 30M images with ~10M selected for actual training. We significantly expanded the types of datasets we consume and improved our detection of content we want to exclude from training. One interesting update: for the first time, we didn't completely exclude AI-generated content. Previously, we were afraid it would affect the model's style too much without better style control, but our research in style clusters helped alleviate this issue. We'll continue increasing synthetic content, including our own generation loops, to improve character recognition and especially style blending.

Our captioning approach also evolved. For the original V7 dataset, we performed extensive captioning with our own fine-tuned InterVL model (releasing it soon), which worked decently but was hard to scale. As a result, we ended up with only one caption per image and prioritized extremely descriptive ones, which we believe contributed to unstable prompt adherence in V7. We've moved to Gemini-based captions, which are exceptionally high quality, exceed our previous tech in OCR, let us caption each image with multiple captions of various lengths, and are surprisingly relaxed in the type of content they can caption.

Aesthetic scoring

Pony models use score tags to introduce quality bias during generation. Historically, we used a simple MLP consuming ViT-L/14 embeddings and outputting a score in the range of 0 to 1 (which is then converted to score_0 through score_9 and used to caption images). In a way, this is a more complicated and opinionated way of asking a CLIP model to provide a definition of "masterpiece." There are many benefits to using such techniques. The OpenAI-trained CLIP has an extensive breadth of knowledge of both visual and text embeddings. Plus, as models like SDXL also use ViT-L to process prompts, this increases alignment between the training data and the model's ability to understand text.

While the OpenAI CLIP performs very well for both generation and dataset processing, we had concerns with it. This CLIP model has "reward hacking" failure cases where its definition of "looking good" doesn't align with human expectations. If you've ever seen images with weird contrast and plastic-like shading so prevalent in early AI generation—CLIP is one of the most impactful reasons for this.

Another concern was that this CLIP version has been significantly outperformed by many newer models (not from OpenAI), or that CLIP should be replaced altogether with a vision transformer-based model. To validate this, we ran testing with different model architectures, which led to two discoveries:

The original OpenAI models still demonstrate superior performance compared to other, even larger CLIP models that may outperform them on evaluation metrics. On real-world datasets, the breadth of OpenAI's training dataset shines through.
Vision Transformer models may outperform CLIP and demonstrate better nuanced understanding of quality, but only with sufficient data that goes way beyond what we have today.

In summary: we're excited about adopting a combined CLIP-ViT architecture (unsurprisingly, models like Seedream came to similar conclusions) as soon as we can collect sufficient quality data.

Text Rendering

While text rendering has never been a goal of V7, despite significantly outperforming V6, it provides a degraded experience compared to stock AuraFlow and frontier models. We believe the main source of this is our training dataset, which focuses on images without text. Extensive training on such a dataset made the model lose its ability to output quality text. This isn't a simple problem—for example, frontier models like Seedream, which excel in text rendering, dedicate half their training dataset to images with text, consisting of both organic and synthetically generated textual data. Creating such datasets is a massive project and a significant time and money investment, which is overkill for a model like Pony. Nevertheless, we'd like to improve text rendering in V8 through two strategies:

Slightly increasing organic and introducing synthetic datasets with text
Starting with a model with stronger text rendering capabilities (i.e., see QWEN section at the end)

We don't expect SOTA text rendering performance, but being able to have simple dialogue boxes or UI would be very beneficial.

Style clustering

One of V7's major changes was adopting style cluster tags in addition to score tags. We developed a precise style classifier (releasing soon with captioning Colab) that works for many different types of content—from photos to 3D, from sketches to digital illustrations—and processes the full dataset. You'll see it being used in some of the highlight samples, and we continue to believe that style clusters (aka superartists) are the right way to develop style support in Pony models. Unfortunately, the effect of these tags in V7 is still limited (see Limitations), so we're working on improving this in V7.1.

T5 vs CLIP

One of V7's most interesting discoveries was the effect of switching from CLIP to T5. This decision has good reasoning behind it. CLIP is quite limited in its ability to encode many important parts of textual information, which limits model prompt understanding. While there were concerns about whether T5 could sufficiently represent the full range of V7 requirements, it ended up not being a problem by itself. The T5 used in AuraFlow is a Pile T5 variation, but even stock T5 covers a wide range of content.

We did discover a different issue for which we don't yet have a definitive answer, but I wanted to provide context. During V7 training, we noticed that compared to all previous Pony models (which used various CLIP encoders), V7 doesn't acquire the capability of mixing style and content at the same level. For example, many of sufficiently trained models using CLIP may've never seen a portrait of specific character in anime style but also many anime images so when the prompt requires "character X in anime style" the model can sufficiently mix both the content and style. With T5 we encountered many examples where this does not work well as the model either less capable of mixing style and content or that some parts of the content description force specific style no matter how much additional instructions for it to change have been provided. Unfortunately same issue seems to also apply to score_X tags which are unable to overpower the rest of the prompt and trigger the aesthetic bias.

We have ran many experiments, checking if T5 tokenization has any impact, if caption variety may impact this and many others but none was sufficient to significantly affect this issue. The working theory right now is that the model is not learning to distinguish between content snd style elements of the prompt well enough, it is is most likely not a single issue contributing to this so to improve this issue in V7.1 we are running a number of changes during training - even more diverse captioning, extended training time and a very new experimental synthetic pipelie which goal is to create many variations of existing data in different styles helping the model to grasp the idea of 'style'.

Full resolution training vs limited one

If you're training a full model from scratch, you typically start with one smaller resolution (for example, 512px by 512px) and train your model for the largest part of its training at that resolution. Later stages usually introduce images of higher resolutions and different aspect ratios to teach the model about quality. This makes sense for full model training, as just going from 512px to 1024px resolution takes 4x longer to train, so it's a delicate balance between quality and resources needed.

On the other hand, fine-tunes usually go for the highest possible resolution. For example, V6 was trained on resolutions up to 1280px, which allowed the final model to be more consistent at resolutions above SDXL's 1024px. For V7, we followed the same pattern of sticking to the highest resolution we could manage. I believe this was a suboptimal decision that ultimately cost too much training time. V6 was already pushing what could be considered a "fine-tune" due to the amount of training data, and I suspect in V7 we went beyond what's reasonable to train on full resolution images—perhaps due to a larger gap than expected between our dataset (which is very diverse in represented styles) and AuraFlow's focus on realism.

There's no good answer here, but for our V7.1 run, we're starting with a smaller resolution dataset (i.e., 512px) to see the impact.

Chroma

For those who've always wondered—what if Pony but Flux?—I'd encourage you to try Chroma, developed by Lodestone. We love this model and have been a sponsor of this project for a long time.

Both have similar prompt adherence

V7 prompting is more opinionated, but both use special quality tags
Chroma should perform better on small details and especially text
V7 has a slightly more diverse training dataset
Chroma is optimized for 1024px resolution, and AuraFlow can go up to 1546px

For me, Chroma confirms some important questions and observations:

Training on top of distilled models is hard and requires significant engineering talent in the model editing space that we didn't possess when starting V7. Despite the tradeoffs and limitations of AuraFlow, it was the right call for us.
VAE selection, while definitely having an impact on the model, is limited. The VAEs we have access to degrade quality to some extent, and even with Flux VAE, we would've hit similar issues (hence pixel space Chroma).
"Prompt locking" (I don't have a better name for this) happens not only in V7 but seems to be a side effect of moving away from CLIP encoders.

Limitations we discovered in V7

There are a number of areas where V7 doesn't reach the bar we anticipated—specifically, the ability to distinguish between content and style, resulting in "prompt locking" where specific prompt elements force a style that other parts can't override. For example, the presence of "portrait" forces "photo style" no matter how strong the style tags are. This is a complex problem stemming from a mix of T5, insufficient training, and data issues. We tried a number of V7-specific hacks like adding additional encoders for style and quality, but they proved inefficient unless done very early in training. We expect to at least partially mitigate this in V7.1 by increasing training and using special synthetic data aimed at this specific issue, but it may also be a fundamental limitation of the architecture we use (as a similar issue exists in Chroma).

V7.1

We're working on training an updated version of V7 (along with some style LoRAs that should reinforce style cluster selection). I promise this won't be another 18-month wait! We'd like to extend the full potential of the V7 line as it's a very capable model with reasonable GPU requirements that should be accessible to a large number of users.

V7.1 will also be our last model under the Pony license, and we'll switch to Apache 2 licensing for the next generation of models.

V8 and editing model

For those who've been following Pony model development closely, it's no surprise that I don't like LoRAs, nor am I a big fan of ControlNets. Such tech, while useful, has always felt like a hack to me, so I've been very happy to see the rise of editing models. Want to use pose control? Just provide an image of the pose. Looking for a particular style? Why not use a few sample images to instruct the model how to draw things?

We've planned an editing model for a long time and originally called it PomniGen, as we expected to use OmniGen (and I like this name too much to drop it), so we'll keep it. It's actually a QWEN/QWEN Editing alternative. We're cleaning up our own extensive Pony-flavored editing dataset and are excited to see how well it performs on various character-focused tasks.

I also promise we'll be sharing ongoing checkpoints instead of waiting for a fully trained model this time!

Anyway, time to get back to Pony—and don't forget to check out Fictional!

Long path to V7 and beyond