Sign In

Style of Studio Ghibli ๐ŸŒŠ Flux.1-D

95
899
1k
53
Verified:
SafeTensor
Type
LoRA
Stats
899
1,049
3.1k
Reviews
Published
Oct 6, 2024
Base Model
Flux.1 D
Training
Steps: 16,250
Usage Tips
Strength: 1
Trigger Words
In style of Studio Ghibli
Training Images
Download
Hash
AutoV2
2C7048A245
seruva19's Avatar
seruva19
The FLUX.1 [dev] Model is licensed by Black Forest Labs. Inc. under the FLUX.1 [dev] Non-Commercial License. Copyright Black Forest Labs. Inc.
IN NO EVENT SHALL BLACK FOREST LABS, INC. BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH USE OF THIS MODEL.

Overview

There's no need to introduce Studio Ghibli and its world-famous art style. There are already some great Flux models that reproduce this style (I especially love and recommend this one). Here is my attempt to create a Ghibli-style LoRA. Although I didn't succeeded in making the best Ghibli LoRA ever as I planned originally ๐Ÿ™‚, the result still isn't too bad (sometimes). However, there is definitely room for improvement. I'm not satisfied with the current anatomy error rate, and I'm working on the next version, and though I can't provide a precise timeline, it will definitely be updated and improved.

Usage

The trigger phrase is "In style of Studio Ghibli" But it also works without any trigger words, although I didn't extensively test the effects in this case. Using "anime" or "Miyazaki" also triggers style changes.

Recommended inference settings are as follows:

Model: flux1-dev (fp8e4m3fn)
Text Encoder: t5pxxl_fp16
Sampler: euler
Scheduler: 24 steps (normal)
Flux Guidance: 4
LoRA Strength: 1

All images that I am publishing in the gallery are created using simple text-to-image (without inpainting, ControlNet, upscaling, etc.), for the purpose of showcasing the raw capabilities (as well as limitations and weak points) of the model.

Training

The LoRA was fine-tuned on a single RTX 3090 with 954 high-quality images (1080p resolution) sourced from the official Ghibli site. The images were captioned using Joy Caption Pre Alpha (later versions didn't exist at the time), hosted locally. The Joy Caption prompt used was: "A descriptive caption for this image:\n". All captions were prefixed with the phrase "In style of Studio Ghibli." Then captions were manually reviewed: I corrected errors (there were a lot...), added some missing details, etc. Characters or locations were not tagged.

Images from specific films were given an additional caption: "Scene from '...' film" (see below for details). Out of the 954 images, there were the following:

50 images from "Nausicaรค of the Valley of the Wind" - additionally prefixed with "Scene from 'Nausicaa' film."
50 images from "Castle in the Sky" - additionally prefixed with "Scene from 'Laputa' film."
50 images from "My Neighbor Totoro" - additionally prefixed with "Scene from 'Totoro' film."
50 images from "Kiki's Delivery Service" - additionally prefixed with "Scene from 'Kiki's Delivery Service' film."
50 images from "Only Yesterday" - additionally prefixed with "Scene from 'Only Yesterday' film."
50 images from "Porco Rosso" - additionally prefixed with "Scene from 'Porco Rosso' film."
50 images from "Ocean Waves" - additionally prefixed with "Scene from 'Ocean Waves' film."
50 images from "Pom Poko" - additionally prefixed with "Scene from 'Pom Poko' film."
28 images from "On Your Mark" - additionally prefixed with "Scene from 'On Your Mark' film."
50 images from "Whisper of the Heart" - additionally prefixed with "Scene from 'Whisper Of The Heart' film."
50 images from "Princess Mononoke" - additionally prefixed with "Scene from 'Mononoke' film."
50 images from "Spirited Away" - additionally prefixed with "Scene from 'Spirited Away' film."
50 images from "Howl's Moving Castle" - additionally prefixed with "Scene from 'Howl's Moving Castle' film."
50 images from "Tales from Earthsea" - additionally prefixed with "Scene from 'Earthsea' film."
50 images from "Ponyo on the Cliff by the Sea" - additionally prefixed with "Scene from 'Ponyo' film."
50 images from "Arrietty" - additionally prefixed with "Scene from 'Arrietty' film."
50 images from "From Up on Poppy Hill" - additionally prefixed with "Scene from 'Poppy Hill' film."
50 images from "The Wind Rises" - additionally prefixed with "Scene from 'Wind Rises' film."
50 images from "When Marnie Was There" - additionally prefixed with "Scene from 'Marnie' film."
26 images from "The Boy and the Heron" - additionally prefixed with "Scene from 'The Boy And The Heron' film."

I plan to reconsider the structure of the dataset and recapture it from scratch for version 0.2.

LoRA training ran for 26000 steps (weights were saved every 250 steps). At this point, the model stopped improving, and anatomy errors (like phantom limbs) became noticeable. Then I spent several days selecting the best LoRA version. My goal was to find the perfect balance between style, variety, and minimizing errors. I mostly tested using long, complex prompts with multiple characters and detailed interactions - ones that were likely to fail - and observed which LoRA failed less ๐Ÿค”.

I deliberately (and wrongly ๐Ÿ˜…) didn't automate my testing and relied on a "click-wait-n-hate" pipeline.

Eventually, I ended up on the model at 16250 steps. The 6000 and 9000-step LoRAs were not bad too, but the 16250-step one felt more "mature", "vintage" and "diverse" (I didn't want to get "too cozy" Ghibli LoRA).

Just for reference, here is comparison (https://ibb.co/TKkgx2D) of how LoRAs at different steps performed (at the same seed). The prompt was:

"In style of Studio Ghibli. Scene from 'Totoro' film. This image is a digitally created scene from a Japanese animated film. The scene features three characters: two young girls and an elderly woman, sitting on a woven mat under a large tree with dense foliage. The background is lush with greenery, including tall trees and vibrant flowers, creating a serene, natural setting. One girl, who appears to be about four years old, wears a yellow dress with white accents and has pigtails tied with red ribbons. She holding a corn cob and smiling happily. Another girl, slightly older, in a white shirt and blue shorts, sits beside her to the left. She has dark hair and a calm expression. The elderly woman, seated to the right, wears a traditional Japanese kimono with a lavender pattern. She has white hair and a gentle smile, holding a bunch of leafy greens. In front of them, on the woven mat, are various vegetables like carrots, tomatoes, and cucumbers, arranged in a basket. The scene exudes a sense of peaceful coexistence with nature, emphasizing simplicity and harmony."

After testing, I realized I made a lot of mistakes ๐Ÿ˜ถ, including:

- I left too much unnecessary clutter in the captions generated by Joy Caption (like "This image is a digitally created scene...", "The scene exudes a sense of peaceful..." etc.) I think CogVLM2 or Qwen2 might be better suited for captioning images for a style LoRA, but more testing is needed. (But I still believe captioning with complex natural prompts is better for style LoRAs.)

- I wanted to make the LoRA a bit unpredictable and varied, perhaps even 'weird,' so it can generate images with high artistic variety. I kind of succeeded, but I feel that this weirdness sometimes may negatively affect coherence and anatomy (mangled hands, etc.). I thought this might be due to overfitting, but even lower-step LoRAs (6000-9000 steps) had these errors. Maybe I shouldn't have included "Pom Poko" screencaps into the dataset (even though I carefully checked captions to prevent feature blending of human and supernatural creatures).

- I should and will explore other trainers besides AI-Toolkit. While it "just works" and produces great models, relying solely on it may be due to status quo bias.

- And even if it is not, sticking to default settings might not have been the best choice anyway.

The LoRA was trained in Windows 11 using AI-Toolkit with the following hyperparameters (actually, all defaults except for the resolution):

Rank: 32
Alpha: 32
Batch Size: 1
Steps: 16250
Learning Rate: 1e-4
Save every: 250
Resolution: 1024, 768
Optimizer: adamw8bit

Thanks for using this LoRA or for reading this text to the end! As mentioned in the beginning, I hope to improve this model as I gain more experience with fine-tuning Flux.