Hi Everyone,
I'm still playing with aesthetic scores and beauty sliders. I tried something interesting (but it's proof of concept, you need to write some python).
Leco score v1: this
Leco score v1 sample: https://civitai.com/models/317942
Leco score v2: https://civitai.com/articles/5416
Leco score v2 sample: https://civitai.com/models/471936
Leco score v2 training guide: https://civitai.com/articles/5422
Current and new training approaches
If you look at the main training loops of Lora and LECO, it looks something like this:
For Lora:
For each epoch
For each image batch
Add some noise to the image (between 1 to 26 steps backward)
Run the stable diffusion checkpoint of choice + Lora
The result should be the image with less noise (with 0 to 25 steps backward noise)
Adjust Lora weights so this is the case
For Sliders:
For each epoch
For each pair of positive_img,negative_img
Add some noise to both image (between 1 to 26 steps backward)
Run the stable diffusion checkpoint of choice + Lora
The result with Lora weight+1 should be the positive_img
The result with Lora weight-1 should be the negative_img
(with 0 to 25 steps backward noise)
Adjust Lora weights so this is the case
For LECO:
For each batch:
Partially create an image by diffusion, between 0 and 25 steps out of 26
For the next step, apply the model with target prompt and normal prompt
The Lora needs to make the normal prompt more like the target prompt
New method, LECO_score:
For each batch:
Partially create an image by diffusion, between 0 and 25 steps out of 26
For the next step, apply the model with target prompt
*** make sure the next step of image generation maximises the image score
Making an image score
Making an image score is nothing new, there are already a few commonly used image scores:
aesthetic scores for an image
NSFW scores for an image
A common approach is to take an existing and already trained image network (Resnet, Inceptionnet, etc...) that already knows a lot about images, and fine-tune it on a dataset (image and aesthetic score, image and NSFW score).
Making this work for LECO_score is slightly trickier because everything is in latent space, not in real image space. A few solutions:
pre-train a base model in latent space (Resnet, Inceptionnet, ...)
pre-train a new generation base model (vision transformer type) in latent space
re-use the stable diffusion model itself as a base model in latent space, by removing its CLIP/text parts
fine-tune an existing model in the usual image space, and in the last step run the latents through the VAE decoder and then the score model in real image space, which could be costly in memory
I only tried the first two:
pre-training a vision transformer in latent space was a failure, the reduced 64*64*4 space of the latents gave a very small vision transformer model (7M parameter), raising the parameter count by adding new attention elements significantly raised the memory requirements (usual quadratic attention cost). Fine-tuning on a modified stable diffusion might be better.
pre-training a Resnet (with the usual target of recognising image classes) actually worked. The image classification results are not very good (42% top5 on AVA instead of 98% usual on Resnet dataset) because of the reduced latent space and the fact that the latent space has normal distribution, I'm actually suprised this worked at all
With a pre-trained resnet, fine-tune on aesthetic score:
objective function is MSE(predicted_aesthetic_score, actual_aesthetic_score)
Then put the fine-tuned resnet in the modified LECO loop:
objective function is, make a Lora that maximises predicted_aesthetic_score (or minimises minus predicted_aesthetic_score)
Weaknesses of the study, remarks:
No augmentations (horizontal flipping), non standard regularization of the weights, non-square images aren't properly handled, code isn't super clean. This is all holding together with tape and school glue, I'm surprised it works at all.
'score' here means a function from an image to a float. Not the reverse diffusion square expectation that you see everywhere in diffusion model papers.
The input images for pre-training are not in the target set: I used AVA images that are photo images. Pre-training the resnet should probably have been done on a mixed dataset (33% AVA, 33% resnet classification dataset, 33% danbooru dataset with multi-classification on top tags)
resnet is no longer State Of The Art for image classification, so VIsion Transformer would probably be better (except that standard vision transformers are very small in the 64*64*4 space)
the aesthetic score dataset (AVA) is for photos, but the resulting Lora was trained on AnyLora which is an anime training checkpoint. This gives an interesting result (anime with elegant poses for the characters, soft shadows, vibrant colors or arty black and white). But this is not for general usage (make anything look good). There are more general aesthetic scores datasets, and an aesthetic score dataset for anime should be created. (I think the final result is fun though).
since we're training a prompt to image model, you're probably wondering where the prompt is. Well, it's in the LECO training files. A new image is partially generated from scratch from the list of prompts given to LECO, and at some step, we ask the Lora to maximise the score (predicted AVA aesthetic score in our case).
this somehow breaks conditionality of the Lora on the prompt: since no prompt is used in scoring you cannot get 'a Lora that is more beautiful when I write beautiful and uglier when I write ugly' only 'higher score with higher Lora weight', not sure whether this is a bug or a feature.
the prompts used in training the final Lora through LECO_score are the standard PartiPrompts. But danbooru prompts are usually used in an anime model. This works because anime models understand standard prompts, but I'm not sure whether this is better or worse than having an anime prompt list.
while LLMs such as chatGPT started with a score (reward model) and evolved to directly optimising on good text (direct preference optimisation), this goes the reverse route as things started with direct optimisation (dreambooth) and this is a score model
another way to train a Lora without prompts would be to use the CLIP trick. In the CLIP model, the CLIP text encoder transforms a text into a vector, the CLIP image encoder transforms an image into a vector, and if the image matches the text the two vectors align (in the cosine distance/dotprod sense). This makes it possible to train a stable diffusion model without text (by replacing the CLIP text encoding of the prompt by the CLIP image encoding of the image which is similar), such as this paper:https://arxiv.org/abs/2403.14944
Main contribution/take away points:
training a resnet in latent space somehow works. This isn't a complete surprise (stable diffusion itself runs in latent space).
it is possible to train a lora from just a score on each image (no need for image tagging) in some cases. This probably won't work on something very specific (I want this character in this outfit drawn by this artist) and it is computationally costly (fine tune a resnet on image score and run LECO).