(EDIT: Midjourney has now implemented personalized style, but from choice/rankings instead of giving individual scores for each image)
Leco score v1: https://civitai.com/articles/4216
Leco score v1 sample: https://civitai.com/models/317942
Leco score v2: this
Leco score v2 sample: https://civitai.com/models/471936
Leco score v2 training guide: https://civitai.com/articles/5422
New features in v2:
the code is cleaner, uses the standard Resnet implementation from timm
the training set is composed of images from AVA (photo), Danbooru (anime), Imagenet (realistic) and Wikiart (paintings), with approximately 250k images each
the criterion for initial training of the resnet in latent space is 1D-EMD and classification on AVA, nsfw classification and tags classification on Danbooru, style classification and year regression on wikiart, and of course classification on imagenet
to mix all the criteria, the obvious solution would have been to sum them, but it brings the question of scaling the different criteria to make sure they converge at the same time
instead, a Lion optimiser (scaleless since it takes the sign of gradient/momentum) is put on each criterion, and each optimiser is stepped independently
at 'convergence', progress in one validation metric makes another validation metric worse
this method seems to be explainable in terms of meta-learning/MAML, as this is trying to find a resnet that is equidistant (in number of necessary finetuning steps) from solving AVA classification, Danbooru tags classification, wikiart style classfication and imagenet classification
I can't find the paper, but information not present in a network when you start training it will be difficult to teach to the network (it's the paper where they train on half images and the missing side never catches up in training) so hopefully this will finetune easily
this was a pain to train, latent space has normal distribution, resnets are hard to train and meta learning was tricky, doing all 3 on my 2060 was probably not a bright idea
Rating images
10k images with high rating have been downloaded from civitai
a very small flask website/app lets you rate those images on a scale from 1 to 5 stars (see the training guide)
Code files to do the initial training:
make_all_data. py takes the raw AVA/Imagenet/Wikiart/Booru files and turns them into latent tensors
divide_safetensors .py take those tensor files, does some additional postprocessing and divides the files into smaller chunks to fit into memory
train_resnet_lightning .py does the initial training
Code files for fine-tuning
the rate_images folder contains the flask mini-site
ratings_to_tensor .py also takes the data from the rating mini-site and converts it into tensors
finetune_resnet .py finetunes the resnet to match the given ratings (can use the tensor output above or use any file that has 'latents' and 'float score' data
Making a Lora
using the Leco codebase, the resnet is instantiated before the main loop, and the loss function is replaced with '-score(image.sample)' to maximise the score
there is a training file with PartiPrompts (the one I usually use for training Leco), I'm still waiting for Google Research to make the Gecko evaluation prompts available (a set of prompts class-balanced in terms of required skills) https://arxiv.org/abs/2404.16820
Other training methods/networks
it should be possible to train using the sd-script Lora loop, by replacing 'loss = MSE(predicted_noise_t_t+1, actual_noise_t_t+1)' with 'loss=-score(latents)' but you will have to reconstruct the latents from the noise and d_noise_pred. This should give some bias (score is only improved along the given images) and make things a lot faster.
the sliders codebase is almost identical to the Leco codebase, the code can be modified in the same manner (add a resnet instantiation, replace 'loss=prompt_loss()' by 'loss=-score(latent.sample)'
I haven't tried adding the following in the main diffusion loop, I'd be surprised if something so simple worked (since fine tuning the resnet takes a minute, this would mean a few-few-shot style learning, lora being few-shot), taking into account there already is research on zero-shot style copy by injecting a reference image in some parts of the unet ( https://github.com/naver-ai/Visual-Style-Prompting )
temp = score(x)
temp.backward()
d_x = x.grad
x += 0.1*dx
(EDIT, this actually works outside of the diffusion loop, fully diffused images can be marginally moved to a higher score image)
in terms of Lora training code, v1 and v2 are very similar, both require the instantiation of a pytorch module and the path to a checkpoint, the module always has a score signature (latent 4x64x64->1)