santa hat
deerdeer nosedeer glow
Sign In

Evaluating Anime Models Systematically - Basics

Oct 17, 2023

hf mirror for full-sized images and history

I was trying to refine my character models when I realized how I've been making models is really inefficient. It typically goes like, tweak some configs or data, try some random prompts, see if they look okay. It should be helpful to establish a well-defined procedure. Then it's apparent that to evaluate fine-tuned models, knowing and quantifying how the base models perform as a baseline is essential. So here I am, trying to evaluate base models.

I collected 1000 random prompts from Danbooru posts from 2021-2022 with the query chartags:0 -is:child -rating:e,q order:random score:>=10 filetype:jpg,png,webp ratio:0.45..2.1 and generated 1000 640x640 images with them for each of 3 widely-used anime models: animefull-latest, Counterfeit-V3.0, MeinaMix_V11.

A model can be evaluated over a number of aspects: fidelity, text-image alignment, aesthetics, diversity. Let's go through them one by one.


Generated images should be indistinguishable from real ones. They should make sense and not contain obvious errors such as extra limbs, mutated fingers, glitches or random blobs. In literature, it's common to use metrics based on distribution distance, such as FID and IS. I calculated the KID score of the 3 sets of images against the 1000 real images.

model      	        KID (lower better)
animefull-latest	0.01192
Counterfeit-V3.0	0.01807
MeinaMix_V11    	0.01345

It seems like that KID does not align with human evaluation, which would generally rate animefull-latest as the worst one. This is kind of expected, since models with strong style would have a different image feature distribution than random real images.

I also tried multimodal LLMs, including GPT-4V and LLaVA, and unfortunately find them quite useless. GPT-4V is supposedly SOTA, but it's clear that it quite useless at spotting generation errors.



So currently I can't find a process that computes a fidelity score for anime models. Have to wait for someone to train a specialized model for now.

Text-Image Alignment

Generated images should not contradict the text prompts. A popular metric is the CLIP score, which is the cosine similarity of the projected CLIP embeddings. There's also PickScore_v1 which is fine-tuned on human preference data. These are not well-suited for anime models due to how different Booru tagging is from regular images.

Model using booru-tag prompts can be evaluated with a tagger. Specifically, I used wd-v1-4-moat-tagger-v2 with a threshold of 0.35. A tag accuracy score can be defined as #{prompted tags correctly reproduced}/#{prompted tags}. The accuracy is macro-averaged over all images. Here are the scores:

model            	tag accuracy (higher better)
animefull-latest	0.464328
Counterfeit-V3.0	0.434574
MeinaMix_V11	        0.375389

It can be seen that fine-tunes or merges may produce nicer images but at the cost of controllability.


Images should be pretty. While this is generally subjective, there are models that give an aesthetic score, either averaged from many people's preferences or personalized. There are CLIP based models (aesthetic-predictor, improved-aesthetic-predictor) and some custom models (anime-aesthetic, cafe_aesthetic).

I tested averaged improved-aesthetic-predictor and anime-aesthetic:

model            	improved-aesthetic-predictor (higher better)	anime-aesthetic (higher better)
animefull-latest	6.124954	0.639767
Counterfeit-V3.0	6.359464	0.789190
MeinaMix_V11	        6.474662	0.829989

The two scores appears to agree.

Interestingly, GPT-4V does a reasonable job at this.image/png


Even with the same prompt, given different random seeds, generated images should not be repetitive. There's this DIV score defined in the Dreambooth paper, which calculates image similarity with LPIPS. For this particular set of images, this metric is not applicable, and I will leave it to a future update.


It's possible to programmatically generate some numbers given a base model. We can use the numbers as a proxy of the model's overall performance.

Miscellaneous notes

I used diffusers and 13 images from animefull-latest came out as solid black images for unknown reasons even with the safety checker disabled and single precision VAE. These images and their counterparts were excluded in metrics calculation.

The images and prompts can be found here.

It's possible that some models perform better with special configs, but for simplicity I kept them the same.

The code for image generation and metrics is quite messy so I'm will not upload it right now, but feel free to ask questions or give suggestions.

I probably would create a fidelity model eventually if no one does, but it will take a while.

Prompts with more tags have lower tag accuracyimage/png

The affect of tag position is measurable albeit less pronounced. The trend at the pos 20-25 may be due to the 77-token limit wraparound.image/png

The next post will be about evaluating character models.