Captioning and prompting primer for V7

TL;DR: This article breaks down the principles behind Pony V7 captioning and how to craft prompts effectively. Pony V7 relies on long descriptive prompts to maximize quality, check primer for one, we also released a Colab and supporting models for it. V7 is overly focused on detailed prompting, we are working on better support for short ones in V7.1

>> V7 model here <<

The Big Picture

Text-to-image models like Stable Diffusion, Flux, or AuraFlow take text inputs and produce images (duh!). These image generation models can be conditioned on many different input types, but the major breakthrough of recent years is that we can efficiently do this through an interface that's easy for humans to understand: text.

While text is an accessible and intuitive interface, it comes with its own set of challenges.

First, text is a lossy "compression format"—it doesn't precisely describe an image. Given a single prompt, you can generate multiple different images unless you endlessly specify each detail to reach the level of resolution necessary to "lock in" the image.

The second issue with text prompting is what's not included in the prompt. When describing the image they want to generate, users often omit many critical elements that affect the quality or visual style of the generated image.

When a user says "I want a picture of a girl" what's omitted from the prompt is "...and make it anime, but not just any anime—I want it to be nice-looking anime, and even better if it's in the style of that Pixiv artist I know, and she should be in a good action pose."

Therefore, we're dealing with two conflicting goals when designing models. On one hand, the model needs to be as configurable as possible to increase the specificity of what's generated. On the other hand, the model must generate images aligned with users' implicit preferences.

There are a few ways to achieve this:

Style-locked models – For example, an anime-focused model sets the expectation that all images will be anime-styled.
Style replication of specific artists – This is a very powerful technique that aligns what the model creates with what the user wants across multiple dimensions: overall quality, specific visual style, and concepts displayed (i.e., you typically prompt for an artist who draws cool furries when generating your own furries, rather than, say, airplanes).
Building (aesthetic) bias into the model – It's possible to (post-)train models to better align with human preferences (through techniques like RLHF and similar methods) by building associations between specific inputs and other properties. For example, "portrait" might also imply "35mm, studio light" even if that text isn't present in the prompt.
Quality modifiers – These modifiers bias the model toward certain quality characteristics, such as score_9 or masterpiece that were originally introduced in the training data (captions).

Pony models over time became much more general-purpose in terms of represented styles and by design don't allow artist style replication, so two common approaches are off the table. The diversity of styles also makes getting the aesthetic bias right in to a complicated task but as best effort attempt for a long time Pony models mostly relied on quality modifiers—i.e., score_X tags.

Enter V7

The general approach behind V7 is: "Make text a detailed interface that's still manageable by experts, and provide tools to non-experts to create cool-looking stuff."

This means the model accepts detailed captions as input, and having a T5 tokenizer greatly helps expand the model's understanding of the prompt (yet, introduces other issues, please see our V7 release summary).

Aligned with this approach, the model has been trained to accept detailed captions in a specific format outlined below. The complexity of prompts has been capped to avoid being extremely verbose (so they remain human-editable), but they're still pretty long. Understanding this format will help you get the best out of the model.

Getting Good Captions the Simple Way

The easiest way to bootstrap the captioning process is to find an image you like and run it through the captioning pipeline available in this Colab.

An Image Prompt Primer

Let's use this image to study what a good image caption/prompt may look like.

score_9, rating_sensitive, style_cluster_430, A smiling anthro female Pinkie Pie is dressed in a bridal outfit, complete with a white veil and a blue choker with a blue bow. She is wearing a white wedding dress with a sweetheart neckline, and her medium breasts are visible. Her pink skin contrasts with her vibrant blue eyes. She is holding a bouquet of flowers with orange and yellow roses. The background is a soft, blurred blue, putting the focus on the character. Medium close-up with a slight low angle perspective. Soft lighting from the top left. Digital illustration with semi-realistic style. Vibrant color palette utilizing a complementary color scheme of pinks and blues. Subtle highlights and specular reflections enhance the textures. The image has a contemporary aesthetic with a playful and whimsical feel. 1girl, solo, long hair, breasts, looking at viewer, smile, blue eyes, gloves, dress, holding, animal ears, cleavage, bare shoulders, medium breasts, pink hair, flower, choker, artist name, white gloves, white dress, blue background, horse ears, white flower, furry, veil, furry female, bouquet, wedding dress, bridal veil, holding bouquet, bride, pink skin

IMPORTANT: As mentioned in our V7 release summary the effect of special tags below may be overpowered by the natural language, we are working on V7.1 to address this issue and improve short caption support.

Aesthetic Score (score_9)

Each image in the model's training dataset has an associated aesthetic score that captures generalized "quality" properties of an image. While beauty is in the eye of the beholder, this scoring system allows us to generally distinguish good images from bad ones. Aesthetic scores were the primary tool in V6 and earlier models for achieving desired output quality, so they remain a useful utility in V7—even though we now have a few better tricks at our disposal.

When used alongside other elements like style clusters and natural language style descriptions, aesthetic scores are no longer very impactful. However they still provide a bias toward generating nice-looking images.

The V7 version of aesthetic classifier has been release here.

Safety Rating (rating_sensitive)

Due to how wide the range of supported concepts Pony models can generate images sometimes inappropriate for ambiguous prompts, so rating tags allow you to better align generated images with your content preferences. Use one of rating_general, rating_sensitive, or rating_explicit to specify your desired content level.

Style Clusters (style_cluster_430)

Pony models historically don't allow artist style replication, which puts the model at a disadvantage: useful stylistic bias can no longer be easily introduced, leaving users to rely on general aesthetic tags (score_X) and image style descriptions—which were severely lacking in V6.

In V7, we introduce much stronger style description control via natural language, but also a way to more specifically target "superartists" via style clusters.

A superartist is a way to group artists with similar styles while ensuring that no single artist's style can be replicated—essentially describing the "vibe" of a group of artists. V7 takes this one step further by recognizing that individual artists can have multiple styles, then separating these styles and reassembling them into style clusters.

Here's how it works: First, a visual transformer network was trained to classify artist styles. By learning to identify which artist created each image, the model developed the ability to produce style embeddings—special sets of numbers that naturally position similar styles close to each other in mathematical space. Then, these embeddings were calculated for all images in the training set, and images were grouped into one of 2,048 clusters. Clusters without enough diversity were pruned, leaving us with approximately 1,800 superartist clusters. We then included the closest cluster tag in each caption in the training set, which allows you to target that specific superartist during inference (i.e., image generation).

If none of that made sense, the important takeaway is this: you now have funny-looking new tags that, depending on the number, mean something like "give me '80s anime" or "give me black-and-white high-contrast photo."

Note: As mentioned above, artists commonly have different styles that evolve over time. Due to the volume of images in the training dataset, the model learns that these styles are both connected (same artist) but not exactly the same (different substyles). Therefore, one artist doesn't equal one style, and the model can distinguish between them.

The V7 version of style classifier has been release here.

Important: We will document style clusters more in the future, please stand by.

Content caption

The goal of the content part of the caption is to capture contents of the image without touching the style part of prompt. While not always possible (different shot types, double exposure) it is only focused on factual image information. These captions triy to pull most critical information about image and introduce characters in the first sentence and then expand on it.

A smiling anthro female Pinkie Pie is dressed in a bridal outfit, complete with a white veil and a blue choker with a blue bow. She is wearing a white wedding dress with a sweetheart neckline, and her medium breasts are visible. Her pink skin contrasts with her vibrant blue eyes. She is holding a bouquet of flowers with orange and yellow roses. The background is a soft, blurred blue, putting the focus on the character. Medium close-up with a slight low angle perspective.

Below is a simplified version of the instruction given to the content captioning pipeline (check full prompt in the captioning Colab).

Begin with a comprehensive summary of the image, detailing the primary subject(s), their appearance, facial expressions, emotions, actions, and the environment.

The caption must meticulously describe every visible aspect of the image, capturing all colors, sizes, textures, materials, and locations of objects. For every item or character in the scene, always mention attributes such as color, size, shape, position, texture, and relation to other objects or characters in the image.

For characters, refer to them by name if known. If the character has a more commonly known name, use that. Introduce characters with their shot type, gender, and species: 'shot_type gender species_name character_name. Mention any well-known media associations after the character’s name or species. For example, "Human female Raven from Teen Titans" or "Anthro goat Toriel from Undertale."

When multiple characters are present, introduce the primary character first and clearly ground the location of all other characters in relation to the primary one. Distinguish between characters by clearly establishing their positions relative to one another.

Background elements must be described thoroughly, with explicit references to their location in relation to the characters or objects. Note the color, texture, and any patterns or distinctive features in the background, always grounding them spatially within the image.

Objects in the scene must be described with attention to every visual feature. Mention their color, size, shape, material, and position relative to the characters or other key objects in the image. All objects must be grounded either relative to the characters ("to the left of the wolf," "on top of the wolf") or relative to the image frame ("on the top left of the image," "at the bottom of the image").

Important: Check how characters are described, V7 works best when you use full character preamble, i.e. Anthro female pony Pinkie Pie from My Little Pony.

The V7 version of content captioner has been release here.

Style caption

The goal of the style part of the caption is to capture how the elements of the image are presented without mentioning the contents.

Soft lighting from the top left. Digital illustration with semi-realistic style. Vibrant color palette utilizing a complementary color scheme of pinks and blues. Subtle highlights and specular reflections enhance the textures. The image has a contemporary aesthetic with a playful and whimsical feel.

Below is a simplified version of the instruction given to the content captioning pipeline (check full prompt in the captioning Colab).

Start by identifying the type of shot used in the image, categorizing it as one of the following: Extreme Long Shot (wide view showing a large scene or landscape), Long Shot or Full Shot (showing the entire body of a character or object)...

Describe any noteworthy compositional properties of the image, if any. Mention if the image uses double exposure (overlaying two images), dutch angle (tilted frame),...

Describe the perspective and depth of the image, if applicable. Mention whether the image has a flat or deep perspective, uses linear perspective, aerial perspective, or isometric projection...

Then, classify the lighting used in the image, selecting from the following terms: Flat lighting, Stagelight, Direct sunlight... Use flat lighting for digital illustrations with simplified lighting that does not try to look realistic, i.e. vector images, anime, etc…

For lighting types that can be localized, note the position of the light source if clearly discernible, such as "from the top left of the character" "directly above the scene"...

Identify the medium of the image: photograph, digital illustration, traditional painting (specify type if clear, e.g., oil, acrylic, watercolor)...

If the image is a photo, mention this and ignore the coloring/shading style instructions below. If the image is clearly not a photo, describe the coloring or shading style of the image choosing from: Cell shading (flat look with few solid tones), soft shading, pixel art, speedpaint...

Identify the color scheme best describing the image's palette, selecting from: Monochromatic color scheme, Grayscale color scheme, Analogous color scheme, ...

Choose any applicable effects present in the image (if any), such as: Film grain, dust specs, motion blur, speed lines, depth of field...

If the image clearly belongs to a specific art historical style or period, mention it. This could include but is not limited to: Renaissance, Baroque, Rococo...

Finally, if the image strongly exhibits a particular aesthetic, describe it using terms like: Synthwave, Outrun...

The style caption is a less precise and more subjective description that captures the overall vibe of the image plus a few important style-specific elements.

The key elements to focus on here are:

Shot Type – From extreme long shot to extreme close-up. (You may notice the shot type placement isn't entirely consistent—we experimented with including it in both the caption and style sections over time, so it should work well in either location.)

Composition Techniques – Special techniques like dutch angle or fisheye lens. (TBD: discuss lenses and cameras specifically.)

Depth Properties – Depth-related characteristics of the image.

Lighting – Type and position of light sources.

Image Medium – Such as 3D render, photo, or digital illustration. Include coloring and shading styles where relevant for the medium (though these tend to be less precise, and style clusters provide better control).

Color Scheme – The overall color palette of the image.

Special Effects – Such as motion blur or other visual effects.

Classical Art Styles – Specific movements like Art Nouveau.

Strong Aesthetics – Distinctive aesthetic themes like Synthwave.

The V7 version of style captioner has been release here.

Next steps

One of the main challenges we faced while preparing the V7 dataset was the lack of caption length diversity. Although we employed various dropout techniques—removing special tags or isolating just the style or content portions—these approaches had limited impact because the core content and style descriptions always remained long and detailed.

This limitation wasn't an oversight; it was a practical constraint we faced while captioning millions of images at scale. The VLMs available at the time struggled to generate structured outputs with multiple caption variations. Creating diverse caption lengths would have extended our already months-long captioning process by several multiples, making it impractical. Compounding this issue, T5's heavy VRAM requirements forced us to encode captions during dataset processing rather than on-the-fly, locking us into a limited set of caption permutations that were randomly selected during each training epoch.

We believe this lack of caption diversity is also behind V7's "sparsity" issue—where some prompts produce high-quality images, but changing them slightly sometimes causes quality to degrade significantly. The model likely didn't learn to generalize well across different caption lengths and variations, making it overly sensitive to specific prompt formulations.

Fortunately, the landscape has changed dramatically. Today's open-source VLMs have improved significantly, and commercial models perform exceptionally well. This has allowed us to introduce substantial caption diversity for V7.1 and V8, addressing one of V7's core limitations. Additionally, while we haven't done any post-training on previous Pony models, we believe this is the right time to introduce it into the Pony lineup. We're currently working on RLHF/DPO versions of the model—which is why we called the first V7 release "base."