Towards Pony Diffusion V7

Hello everyone,

I'm excited to share updates on the progress of our upcoming V7, along with a retrospective analysis of V6.

The recognition V6 has received is heartening, and I am grateful to all the users and derivative models, even though some uses diverge from my initial vision. However, replicating such success isn't straightforward, and expectations for V7 are understandably high. Let’s explore some of the improvements you can expect in V7.

Call for support

But before we dive into the technical details, let's pause for a moment!

Creating models on the scale of Pony Diffusion is a complex and costly endeavor, which is why there are so few of them. I'm eager for PDV7 to deliver even more joyful experiences and to further help the community of model creators. However, I need your assistance to make this happen.

If you're a company interested in being featured in the upcoming V7 release and would like to support the development—either financially or by providing computing resources—this is your opportunity. Please contact me at [email protected] to discuss partnership possibilities.

If you are an individual enthusiast, consider joining the PurpleSmartAi Discord. We offer various Subscription and Sponsor options that contribute to our development efforts.

Or, at the very least, please consider following me on Twitter at https://twitter.com/AstraliteHeart. I enjoy seeing those numbers climb!

Anyway, back to the fun stuff.

Style consistency and selection

One notable aspect of Pony Diffusion is its absence of artist tags—a decision that may have disappointed many as I am well aware. The choice to exclude artist tags stems from a core principle: Pony Diffusion is designed to foster creativity, not to replicate the styles of others. However, artist tags are undeniably potent; they not only introduce a strong quality bias but also guide users towards consistent thematic use (e.g., employing a renowned pony artist for drawing more pony-themed images). This situation places models like PD at a considerable disadvantage, underscoring the need for better tools to manage style and ensure quality.

In V6, we introduced features such as adding rich style descriptions to prompts, which worked adequately but had limitations. Regrettably, more advanced techniques intended to enhance style management in V6 did not perform as well as anticipated.

The community has clearly demonstrated the need for improved style control (see the exceptionally popular collection of style LoRAs by prgfrg23). In response, for V7, I am developing a concept called style grouping, or "super artists", as part of the base model. The aim is to use human feedback on style differences to automatically cluster images by style. I plan to expand on this in a separate article, but the general approach involves using artists as a ground truth for initial training, followed by refining the process through human queries asking whether two images share a similar style. The outcome will introduce special tags like "anime_1", "smooth_shading_48", and "sketch_42", which can be used during training and in model prompts to enhance style fidelity.

Expanded Datasets

For V6, I narrowed down a dataset of approximately 10 million images to 2.6 million top selections. For V7, I have expanded full dataset to about 30 million images, from which I aim to select around 10 million for training. This expansion will enhance the model's capability to support more content types and improve character recognition across various fandoms as I update old data and integrate new sources.

Enhanced SFW Data Coverage

While over 50% of the data used for training PD, particularly V6, was safe for work (SFW), it became evident that this did not offer sufficient diversity. My ongoing efforts to enhance SFW generation capabilities are focused on maintaining the high quality of outputs, with special attention to achieving the right balance in the dataset.

Cosplay dataset

Although my primary focus remains on non-photorealistic styles, the substantial number of 3D images in the dataset, spanning various levels of realism, justifies an extension towards human subjects. At the very least, this should assist derivatives focused on realism to achieve better quality.

Anime dataset

V6 featured a considerable amount of anime-specific data, but you can expect significant enhancements in character recognition and overall support for anime styles, as I am incorporating multiple diverse anime-based datasets.

Video-Based Dataset

As I prepare our infrastructure to handle text-to-video tasks, now is an opportune time to expand our data acquisition pipeline to extract still images from video data. This approach presents new challenges in captioning and selecting the best samples, but I am confident in our initial successful experiments to implement this effectively in V7 and future versions.

Video Game, 3D, Artbook, and Concept Art Dataset

I am also incorporating a variety of miscellaneous sources to address gaps in the model's understanding of media other than characters. This should enhance our SFW capabilities and introduce more unique styles as a result.

Enhanced Captions

The inclusion of natural language captions was undoubtedly a significant breakthrough that contributed greatly to the effectiveness of V6, despite its limited application. In V7, I am focused on enhancing both the quality and the coverage of these captions—V6 only had half of its images fully captioned. The quality of training data is crucial; no matter how adept the model is at understanding prompts, it needs robust data to support it.

Currently, as I continue to refine the captioning model, I'm observing results that surpass any publicly available dataset I've encountered thus far. Below, you can find some examples from our work-in-progress captioning models.

https://derpibooru.org/images/3345861

Female feral alicorn Princess Luna from My Little Pony stands confidently against a backdrop of telephone poles and clouds in a bright, sunny day. She wears sunglasses with reflective orange lenses and a black glove on her right hand, which houses her cybernetic arm. Princess Luna's horn is adorned with a crown, and her mane flows freely behind her. She is wearing a brown bikini top that exposes her midriff.

https://derpibooru.org/images/3340263

A serious female alicorn unicorn, Princess Celestia from My Little Pony, is depicted wearing a dark hoodie and sunglasses with tinted lenses that glow with the sunlight. Celestia's mane flows in a wavy pattern of pastel colors, with shades of blue, green, and pink blending into each other. She is looking at something out of the frame with a displeased expression. The setting is an urban environment with multi-story buildings, snow on the ground, and a clear winter day. In her outstretched magical aura, there is a smartphone with a snowflake logo on its back.

https://derpibooru.org/images/3337672

Apple Bloom, the feral pony with a bright red mane and a big pink bow, and Applejack, the feral pony wearing a cowboy hat, are sitting against a tree, holding an apple. Apple Bloom is looking at Applejack with a smile and Applejack is looking at her with affectionate eyes. They are on a picturesque apple orchard set against a backdrop of a red barn and clouds in the sky.

Improved Aesthetic Scores

For an in-depth understanding of "score_9" and its related metrics, please refer to my previous article here. With V7, I am implementing two major enhancements. Firstly, the issues V6 faced with long prompts will be resolved, allowing for straightforward use of "score_9" and other scoring tags. Secondly, as we transition to larger CLIP models and implement arena-like image ratings, I aim to more accurately capture the quality of images within the tags.

However, with the addition of more content types to the model, some further data ranking will be necessary. I anticipate spending a few days labeling more images to refine these processes.

The broader application of such scoring tags remains an open question, but any significant updates will be postponed to the V8 development timeline. While options like DPO and quality "sliders" are attractive, I prefer to explore these after establishing a strong baseline with simpler mechanisms in V7.

JPEG Artifacts

An issue I hadn't initially noticed in V6, which was brought to my attention by several users, is the presence of JPEG artifacts. Although this problem is only evident in certain styles, I am committed to addressing it. The issue appears to stem from two main sources: some of the source material already contains artifacts, and my pipeline, which involves saving images at 95% quality twice, likely exacerbates the problem.

To resolve this, I am making adjustments to the pipeline to ensure images are directly transferred from the source to VAE encoding without intermediate quality reductions. Additionally, I am developing methods to detect and either automatically correct or exclude images with noticeable artifacts. This should significantly reduce the presence of JPEG artifacts in the output of V7.

Base Model and Timeline

I am keen on training V7 using SD3, although it's currently uncertain whether we will have access to the model weights. I remain hopeful and would be delighted if someone from SAI could discuss this possibility with me. Despite my efforts to reach out, there has been no response yet—perhaps there's a bit of apprehension about being outshined by PD (just a light-hearted thought).

Looking ahead, the next month is dedicated to captioning, a task that requires as much time and resources as model training itself. This will be followed by wrapping up human data collection and completing research work like style grouping. I anticipate beginning the training phase afterward, and I will provide more specific timelines as we approach these stages.

Onward to new frontiers in AI creativity.

Astra, founder of PurpleSmartAI