Description
The original Lumina 2 doesn't understand the Ghibli style. When I prompt it, it produces amazing images with broken anatomy, but it's not Ghibli.
I love the Ghibli art style and have been trying to teach each new model this style. This is the result of my initial efforts with Lumina Image 2. It could be better, but it also could be worse 🤷
Usage
Each image in the gallery has an embedded workflow, so just drop it into ComfyUI.
Almost all parameters are the same as in default workflow, except for:
Steps: 40
Scheduler: sgm_uniform
(Note: These aren't necessarily optimal; I just happened to test LoRA output with these settings.)
I haven't extensively tested other samplers yet, but I've heard that gradient_estimation sampler may offer some improvement and lowering CFG increase chances of proper anatomy.
Training
I used fragments of screencaps from Ghibli movies, total 184 images (1024x1024).
I captioned them using JoyCaption Alpha Two (locally) in "descriptive/long" mode and prefixed each caption with the phrase "You are an assistant designed to generate high-quality images based on user prompts. <Prompt Start> Studio Ghibli style."
(I don’t think the LLM prefix was necessary, but I added it anyway.)
For training I used ai-toolkit, which recently merged its Lumina-2 training branch into the main repository. FYI, another trainer that now supports Lumina-2 training (LoRA and full fine-tuning) is diffusion-pipe. I tested it as well, but for me, it ran slower than ai-toolkit.
Initially I tried the default configuration, but decided the default learning rate (1e-04) is too high. I experimented with various optimizers, and after 2000-3000 steps, the LoRA did not look good. Then I switched to 5e-05, which yielded better results. I decided to train until 20000 steps. The training felt very fast (~1.7s/it on RTX 3090), probably because, for the last two months, I had only been training HunyuanVideo 😆 After the training finished, I chose checkpoints that had either good sample quality during training or the lowest loss (according to TensorBoard logs). I then manually tested them in ComfyUI and decided to go with the checkpoint at 17200 steps.
(I am completely sure the total number of steps to get a good LoRA can be dramatically reduced. This is just the first try, and besides, I wanted to test how long Lumina-2 can be trained at all.)
To sum up, I ended up using these parameters:
lr: 5e-5
optimizer: "adamw8bit"
optimizer_params.betas: [0.95, 0.98]
optimizer_params.weight_decay: 0.01
noise_offset: 0.1
lr_scheduler: "cosine"
Other hyperparameters were left as default. The dataset and configuration file are included in the attachment with this LoRA.
(I think leaving rank 16 was a mistake, Lumina-2 LoRAs need to be at least rank 32 to fully grasp style detail. But I am still learning.)
One thing to mention: ai-toolkit saves LoRA safetensors files in a format that is not compatible with ComfyUI. To address this, I included (into training data) a script that converts ai-toolkit LoRA checkpoints into a ComfyUI-compatible format. The script is called lumina2comfy.py. To use it, simply run it with first argument set to a path to safetensors LoRA file created with ai-toolkit, and it will save the converted LoRA alongside the original file, like
python lumina2comfy.py "path/to/my/lora.safetensors"
(You can also pass a folder path as an argument, and it will convert all safetensors files within it.)
Maybe someone will find it handy. 😊
(By the way, diffusion-pipe does not require LoRA conversion - it already outputs files in a ComfyUI-compatible format.)
My first impression of the training is rather favorable. I don’t know how well it can be fully fine-tuned (I don’t have the time or proper datasets to test this) and haven’t tried character or concept LoRA training, but teaching new styles to this model seems promising to me.
Thoughts
(First, I think I made a mistake training a rank 16 LoRA, it probably should not be below 64.)
My amateur opinion on Lumina 2 is that, while it's obviously not as good as Flux (2B vs. 12B), it could be a solid base model for anime/illustrations. It has issues with anatomy, but any model of this class struggles with anatomy (and fails at text rendering). As for NSFW content, I did not test it, but people who did say it's not good.
Its strongest points are the 16-channel VAE and amazing prompt adherence (it's better than anything I’ve seen before, sometimes even matches Flux). And its license is the best possible among all T2I models (but they seriously need to switch to Schnell's VAE).
The real question is, could the anatomy be fixed with finetunes? I think yes, but I cannot be sure. A lot of images I got while testing contain various body deformations, but maybe it was partially LoRA's fault.
With community support, it could become another NAI-XL. However, without proper anime fine-tunes, it risks fading into oblivion. Like as SD3.5M is currently at risk of doing. Also does today anyone remember Kwai Kolors? PixArt Sigma? Hunyuan DIT? ☹️