(This article is the English version of https://civitai.com/articles/1536/ghostreviewaickptenglish-version-is-translating , it is translated by ChatGPT,so maybe it has some mistakes.But the marjority is right. So if you can read Chinese, please read the last one. Thank you.)
Hello everyone, I am GhostInShell, the author of GhostMix, which ranks second in the All Time Highest Rated category on the global AI painting model website, Civitai. In my previous article, I mainly discussed my views on the development direction of ckpt (checkpoint). In short, ckpt should address the "doable" problems like compatibility, while solutions like LoRA and Controlnet should tackle the "right" problems. The correct development direction for ckpt should be fewer ckpt and more LoRA, indicating that Stable Diffusion should develop in the direction of a unified large model.
After the article was published, many users agreed with my views and hoped for a tool to evaluate the ckpt of Stable Diffusion to guide the model's development direction. After some research, I found no scoring platform like LLM's LMSYS in the market. So, as a student of finance, I took the initiative to write code and developed the framework code for the ckpt evaluation system within a month, producing GhostReview, the world's first AI painting ckpt evaluation framework code, consisting of 1,128 lines. (PS: This is purely my own evaluation framework. I am not a CS major, so my knowledge and understanding are quite limited. If you don't like my approach, please feel free to dismiss it and create your own. I do not wish to engage in disputes. Let the Code Speak, thank you.)
Because it is not possible to analyze the model structure itself, the evaluation method involves analyzing the images generated by the model. GhostReview's core is similar to a quantitative financial factor model. In simple terms, both are researching how to find commonalities in chaotic data. Taking the simplest single-factor model CAPM as an example, the return of a stock is composed of Beta and Alpha, with Beta being seen as systematic risk and Alpha as individual risk. With this concept, when we look at the images produced by ckpt, we can also divide an image into two parts, one is the systemic impact (i.e., the model impact), and the other part is the individual impact (i.e., the influence brought by the random seed). Our goal is to evaluate the model, that is, to evaluate the systematic impact of the model, so the core processing approach is the same as the factor model, that is, taking an average to average out the individual impact.
Regarding the evaluation indicators, as I mentioned in the previous article on the ckpt model evaluation system, I believe the standards for evaluating ckpt are: 1. Model compatibility (style, LoRA, Prompts, etc.) 2. The quality of the generated images 3. The quality of the model's images. So, GhostReview V1.0 is developed based on my own evaluation system. GhostReview V1.0 consists of three parts: 1. Analysis of ckpt image quality and generalizability (Prompt compatibility, image quality, good image rate) 2. Analysis of ckpt style compatibility (style compatibility) 3. Analysis of ckpt compatibility with LoRA (LoRA compatibility).
1. Analysis of Image Quality and Generalizability of Checkpoints (Prompt Compatibility, Picture Quality, Good Image Rate)
The first part is an analysis of the image quality and generalizability of the checkpoints (Prompt compatibility, Picture quality, Good image rate). As previously mentioned, the ckpt themselves cannot be tested. The core of the test is to score all the generated images using the Aesthetics Scorer project under conditions of fixed random seed and all other settings. The mean and standard deviation can then quantify the picture quality and good image rate of the model under given Prompts.
For the Aesthetics Scorer project, we used: https://github.com/christophschuhmann/improved-aesthetic-predictor. The improved-aesthetic-predictor (LAION-Aesthetics V2) is developed by Christoph Schuhmann, the author of LAION 5B, based on LAION-Aesthetics V1. It used 176,000 SAC (Simulacra Aesthetic Captions) image rating pairs, 15,000 LAION-Logos image rating pairs, and most of the 250,000 AVA (The Aesthetic Visual Analysis) data image rating pairs for training. Since Stability.AI's Stable Diffusion model is also trained based on LAION-Aesthetics, the improved-aesthetic-predictor, which has a relatively high degree of reliability and accuracy, was chosen as the standard for the Aesthetics Scorer project. With the aesthetic scores, the good image rate of the model can be numerically measured through the standard deviation of a large number of aesthetic scores.
As for Prompt compatibility, it primarily measures whether the images generated by the model accurately reflect the input of the Prompts. The ClipScore project from OpenAI's https://github.com/openai/CLIP is used for this purpose.
For the selected Prompts, GhostReview used the 25 most common non-political, non-meme, non-pornographic or soft pornographic Prompts from the Image Reaction on Civitai. To ensure that the Prompts cover real, animated, and artistic styles, 5 stylized Prompts were added, totaling 30 Prompts (all without LoRA). Each checkpoint generates 32 images for each Prompt (batch 4, iter 8). Therefore, for a single checkpoint, 960 high-res fix images are generated in the first project.
the distribution of 7 models (6,720 images) LAION score
2. Analysis of Checkpoint Style Compatibility (Artistic Style Compatibility)
The second part is the analysis of checkpoint style compatibility. The test primarily aims to evaluate the checkpoint's compatibility with different artistic styles. The testing method involves inputting style-related Prompts to generate a large number of stylized images. These are then compared with a large number of existing style images to derive a numerical result for the checkpoint's style compatibility. The testing method referenced the StyleLoss calculation method in the paper "A Neural Algorithm of Artistic Style" (Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576). This involved extracting the Featuremap of the generated images and the target images through VGG19, then calculating the Gram Matrix of the Featuremap of each layer, and finally computing the specific StyleLoss.
For the stylized Prompts, we referenced the styles of SDXL and removed those that the existing checkpoints could not directly implement, such as PaperCut. In the end, nine different styles were chosen: Anime, Manga, Photographic, Isometric, Low_Poly, Line_Art, 3D_Model, Pixel_Art, and Watercolor.
9 artstyle of Dreamshaper sample image
3. Checkpoint Compatibility Analysis with LoRA
The third part involves testing the checkpoint's compatibility with LoRA. Similar to the second part, this is done by generating images (20 images are generated using batch 4, iter 5) and calculating StyleLoss against target images, which helps quantify the checkpoint's compatibility with LoRA.
In selecting Prompts and LoRA, each character generated using character LoRA in a checkpoint varies from the original, so stylized LoRA were chosen for LoRA compatibility testing. The selection criteria are the top 16 stylized LoRA from Civitai's All Time Highest Rated. The target images and Prompts used are those from the header images and Prompts of the LoRA. In detail, for a single image with multiple LoRA, the corresponding LoRA is completed (for example, the header image of MoXin). For Prompts without a LoRA field, a LoRA weight of 0.8 is added by default (for example, the header image of the 3D rendering style). If the header image uses an incorrect version of the LoRA field, it is replaced with a new version (for example, the header image of Gacha splash). Some LoRA header images are created using large models that are also tested, such as REV and Majic Realistic, so a GhostLoRALoss_NoTM score version has been made which excludes these LoRA scores when calculating these models.
In terms of data processing, from the first, second, and third parts, we can derive the score for each image of each Prompt for every model. To minimize the score differences between Prompts, each Prompt undergoes standardization. After processing, each image has its standardized score, and the average score is calculated for each model. The first phase of testing involved 7 models, yielding between 140 to 224 images per Prompt. As more models are tested, the database grows, and scores should stabilize. Analyzing the data, some interesting phenomena can be seen. For example, I previously said that the more LoRA is integrated, the poorer the LoRA compatibility; the more LoRA is integrated, the more it affects Prompt accuracy (ClipScore). The charts also explained with data why DreamShaper, GhostMix, and REV are the top three in Civitai's All Time Highest Rated...
the first 7 models of GhostReview. Both Styleloss and LoRALoss are "lower is better"
For the seven models tested in the first phase of GhostReview, both StyleLoss and LoRALoss are better when lower. The entire GhostReview V1.0 process produces a total of 1,568 images per tested model. Because all use HighresFix and take time for LAION-Aesthetics and Clip scores, even with a 4090, it takes hours to run a single model, which is an unacceptable time frame for individual users. So it is not a open source project temporarily.
Lastly, LLM now has LMSYS, SD, even MJ, almost no one is doing anything in the model scoring system... GhostReview is just scratching the surface. I hope more capable people can join in the AI painting model scoring. If I'm fortunate enough to become part of the future SD's "LMSYS" system, I would feel truly fulfilled. Therefore, if there are reliable teams or platforms that wish to do something similar to LMSYS, I am very open to collaboration. Only by joining hands can we accomplish this.