Pony Diffusion V6 XL Prompting Resources and Info

https://docs.google.com/spreadsheets/d/1m2W-pZEvHuEpfHcNHrxCSr-Aw1mgtUUYho6sz9LChEA/edit#gid=0

First Off I will post the link to the resource itself (above), a slapped together excel spreadsheet I made (initially for personal use) from other places to help me out in learning what pony diffusion can do. It became big enough to where I wanted to share it. I also had this as a comment under pony diffusion but because it got drowned out in the comments I decided to post this as a more permanent solution. Below is an Excerpt from AstraliteHeart themself from purplesmart's discord explaining the "score_9", what it is, why it happened, why its needed and what they mean.

Also there is added information by me in the tips part below which I try to explain what you can do with the "special" prompts like source_furry, or rating_safe. These tips and prompting styles will work with any model that directly uses pony diffusion v6 xl, like autismix pony for example.

links and info on usehttps://rentry.co/ponyxl_loras_n_stuff was used to source as well as the purplesmart.ai discord score_9, score_8_up, score_7_up, score_6_up, score_5_up, score_4_up, just describe what you want, tag1, tag2

You may've seen score_9 and similar tags used in prompts or automatically added by the bot, here I will explain what is this tag, how it came to be and how to correctly use it to generate better images in the bot and in local clients.

discord copy from AstraliteHeart https://discord.com/channels/670866322619498507/1199191922401615943

Why

The (simplified) lifecycle of an AI model consists of two stages - Training and Inference. During Training a model that either doesn't know how to do anything (training from scratch), or doesn't know specific things we care like ponies (finetuning) is repeatedly shown images with corresponding captions to "educate" it on things that make sense from perspective of humans. This is a long process and for V6 took about 2-3 months on a very expensive "big boys' computer". When the model is finished we start using it in Discord bot or locally to generate nice pictures, this is called Inference. There are a number of challenges we need to overcome to be able to actually generate something nice. Computers do not have a concept of "nice" and images generated during Inference would generally match (quality of) the ones observed during Training (clever people call this GIGO). An obvious (and unfortunately naïve) option is to train models only on good data. First of all, a lot of concepts (i.e. characters, objects, actions) may not have enough good data to help model learn it. Second, we still don't know how to separate good data form bad data. So, if we want a diverse model (in other words, a model that knows your obscure character or OC) we need to grab as much data as we can. But also, not too much, as with more data comes more training time, and more $$$ burned. So we still need a way to find the subset of good data in all the data available to us.

Teaching machines to know what is good

Luckily for us we have ways of educating machines on what is considered good looking by humans. There are many ways to do so, but PSAI is using something called "CLIP based aesthetic ranking". CLIP or Contrastive Language-Image Pre-training is a way to pair images and descriptions that match such images. To put it simple, it's another AI model capable of accepting both images and texts and measuring how much they correlate to each other. In addition to learning about specific objects like "dog" or "cat" CLIP learns about other concept like "masterpiece", "best quality" or "hd" as these words are common in image captions provided during training and these models are trained on large dataset of captions created by humans who tend to use such language. If you have used Stable Diffusion with other models you may've used such keywords/tags to improve quality of your generations. So why not to just grab CLIP and use it everywhere if it's so good at measuring which images correspond to "masterpiece" and other similar tags. Well, we again have challenges to overcome. First of all, CLIP has been trained on absolutely everything, and second of all the quality of the captions used for CLIP is only ok-ish, way from being perfect. This means that CLIP is not very good at non photo realistic and somewhat less popular content, like ponies, or cartoon furry characters (and works much better on Anime). But, we can still use CLIP to help us as within its internals there are plenty of signals that may not necessary have a good name attached to them but if we can surface them, then we can separate good images form bad ones. (edited)

AstraliteHeart — 01/22/2024 10:20 PMEnter data labeling hell

In order to implement our plan we still need a lot of good images (but also many not so good, and some very bad ones). How can we get some? Well for once we can look at various scores/ranks assigned to them on popular boorus to pick some images. At this point you may say - "Hey, wait a minute. You already have the scores! Just use them to pick good images!" and you will be partially right. Some models (including early Pony Diffusion ones) used the score metadata. Unfortunately, using scores introduces two issues - users rate images based on both style and content, and while they are generally correlated, there are some biases like NSFW content being more popular, or specific characters getting preferential treatment independently of the style, also these scores are affected by age of the image and do not match between different sources of metadata. So, at least we used the scores to pick some decent distribution of images, now let's go over them and rank them in terms of quality, I personally decided to do a 1 to 5 score. Still, two questions remain - how many images do we need and who will rank them. We do need a lot of images, we want to have a decent number of image of each "type", some 3d, some sketches, some semi realistic, etc... Miss some style and the model will not learn how to judge it or learn it wrong. In case of V6 this number was ~20k images. Now, we need someone who can look at images and use their art critique skills to judged the image on the scale we invented. And who is that impartial person, unbiased and neutral, able to make decisions or judgments based on objective criteria rather than personal feelings, interests, or prejudices? It's me, obviously. So, after spending weeks in data processing cave methodically ranking each image I was able to generate our aesthetic dataset. We can now train a new model that would take CLIP's image representation (that we call embedding) and a human rating and learn from them how to rank new images. We can now use this model to run on embeddings of each and every picture we encounter and get a 1 to 5 rank (which is actually now a 0 to 1 rank as computers like this range more). We now solved two big issues, first of all we can use this new model to select only a set of images to train new model on and annotate the captions with a special tag. So best images get a caption like score_9, a cute pony and slightly less good images score_8, maybe not so cute pony. (edited)

AstraliteHeart — 01/22/2024 10:46 PMIt's Training time

We now have annotated data and can finally train the actual Pony Diffusion. Let's keep showing the model images and our captions containing the score tags so it also learns which of the score tags correspond to which images, giving us better versions of "masterpiece". But wait, turned out I messed up a bit! What I described above is how PD V5.X used to do things, in V6 I wanted to also be able to say - "hey, give me anything 80% good and up". But score_8 tag would only give us images in range 80* to 90%. Perhaps using both score_8 and score_9 would work but I wanted to verify that, so I changed the labels form simple score_9 to something more verbose like score_9, score_8_up, score_7_up, score_6_up, score_5_up, score_4_up and score_8 toscore_8, score_7_up, score_6_up, score_5_up, score_4_up. In reality I exposed myself to a variation of The Clever Hans effect where the model learned that the whole long string correlates to the "good looking" images, instead of separate parts of it. Unfortunately by the time I realized it, we were way past the mid point of training, so we just rolled with it (I did try to use shorter tags after the discovery but due to the way we train it didn't have as strong of an effect). So, to summarize - we used a model trained on human preferences to label all data with special tags and then trained an text to image model on this labels allowing us to ask model for "good" images via use of these tags

Do I need to care

Maybe, in some cases. If you use the discord bot you will see the score_9 tag added automatically. This is done so users do not need to care and they get nice looking image as soon as they start using the bot. You may want to not to do this in some rare cases and you can always add expert parameter with value of True to stop the bot from messing with your input. Both the bot and website are aware that V6 needs the long string, but as many users of this Discord are already used to a single tag from V5 era, the bot would do the conversion under the hood. But, if you want to take an input form the site or bot and plug it into application like Automatic1111 or ComfyUI you need to do the translation yourself, just replace score_9 with score_9, score_8_up, score_7_up, score_6_up, score_5_up, score_4_up and you should ge the same output (given that you also matched the seed and other parameters). (edited)

some useful prompting tools and tips

These below will draw or rather (filter) your prompt and image to these specific datasets, OR you can put these in the negative. For instance if prompting "pink hair" gives a pony or pinkie pie, or "bloom" gives applebloom when you dont want it, put "source_pony" in the negative. Likewise if you want loona from helluvaboss but she comes out as human, put "source_furry" in positive to force it out.

source_pony

source_furry

source_anime

source_cartoon

rating_safe

rating_questionable

rating_explicit

Score prompts, you can use them all, or actually use a selection of what group of the scored dataset to use, less means a tighter dataset but could yield better or worse results.

score_9, score_8_up, score_7_up, score_6_up, score_5_up, score_4_up

score_9, score_8_up, score_7_up, score_6_up

score_9,

if 2/3/4 girls or boys isnt working, add a space "4 girls"team_rocket_uniform(puts an extremely accurate "R" on the chest plus has many characterists of team rocket

source, rating, and score will constrain the images used to those sources or ratings, and may help in drawing out a character, such as if you know its from an anime, or cartoon.

you can think of the score as a part of the dataset that just include images with that rating up to 9, ie. if you choose a lower score you will only pull from those lower rated iamges. You can also put some of the "score's" in the negative! an example is "score_6,score_5,score_4,chromatic abberation,artifacts,ugly,bad image,"

Tricks with the model for more anime style (excerpt from https://rentry.co/ponyxl_loras_n_stuff)

There's some evidence that the model has a bit of a bias towards western style art, particularly when using the score tags (it is made by furries after all). It might be beneficial to tag your image sets with the score tags, the laziest way would be to tag every image with score_9, source_anime, this may affect the overall "quality" of images the lora generates since some of the knowledge in the model about "high quality" might be overwritten but it will look more like the artist's style.

You can also help nudge generations with the model more towards anime style by using "source_cartoon, source_furry, source_pony, sketch, painting, monochrome" in the negative prompt when generating images. Some of the artists with more subtle art styles tend to have more success with this in my experience. The bias is a bit less evident in the lower scores so if a LoRA has it's images tagged with score_9 a prompt like "source_anime, score_9, score_6_up, score_5_up, score_4_up" might get better results. Unfortunately this also makes your LoRA harder to use, you'd have to tell people to use it this way.

Other tricks tried were to include all the score tags in the LoRA's training set on every image which didn't have much success or only using the source_anime tag which also didn't seem to influence the LoRA's effectiveness much. I haven't tried actually using the score tags as intended since that takes more effort than I'm willing to put in for a LoRA for the time being but that may net the best results.

Pony Diffusion V6 XL Prompting Resources and Info

Comments