Type | |
Stats | 1,305 0 |
Reviews | (181) |
Published | Mar 28, 2025 |
Base Model | |
Training | Steps: 22,638 Epochs: 21 |
Usage Tips | Strength: 1 |
Trigger Words | Studio Ghibli style |
Training Images | Download |
Hash | AutoV2 DD2FE1258D |
Description
I am very happy to share my magnum opus LoRA, which I've been working on for the past month since Wan came out. This is indeed the best LoRA on Civitai I have ever trained, and I have to say once again - WanVideo is an amazing model.
It was trained for ~90 hours on an RTX 3090 with musubi-tuner using a mixed dataset of 240 clips and 120 images. This could have been done faster, but I was obsessed with pushing the limits to create a state-of-the-art style model. It’s up to you to judge if I succeeded.
Usage
The trigger phrase is Studio Ghibli style - all captions for training data were prefixed with these words.
All clips I publish in gallery are raw model outputs using a single LoRA, without post-processing, upscaling, or interpolation.
Compatibility with other LoRAs and with Wan-I2V models has not been tested.
Workflows are embedded with each clip. You can download example JSON workflow here. I use Kijai's wrapper and enable a lot of optimizations in workflow (more information here), including fp8_e5m2 checkpoints + torch.compile, SageAttention, TeaCache, Enhance-A-Video, Fp16_fast, SLG, and (sometimes) Zero-Star. Rendering a 640x480x81 clip takes about 5 minutes (RTX 3090).
WanVideo Sampler's parameters I use are the following:
Sampler: unipc
Steps: 20
Cfg: 6
Shift: 7
I believe that without optimizations and with an increase in steps, it is possible to achieve higher-quality clips, but I don't have the time or hardware resources to verify this.
To generate most prompts, I usually apply the following meta-prompt in ChatGPT (or Claude, or any other capable LLM), that helps to enhance "raw" descriptions. This prompt is based on official prompt extension code by Wan developers and looks like this:
You are a prompt engineer, specializing in refining user inputs into high-quality prompts for video generation in the distinct Studio Ghibli style. You ensure that the output aligns with the original intent while enriching details for visual and motion clarity.
Task Requirements:
- If the user input is too brief, expand it with reasonable details to create a more vivid and complete scene without altering the core meaning.
- Emphasize key features such as characters' appearances, expressions, clothing, postures, and spatial relationships.
- Always maintain the Studio Ghibli visual aesthetic - soft watercolor-like backgrounds, expressive yet simple character designs, and a warm, nostalgic atmosphere.
- Enhance descriptions of motion and camera movements for natural animation flow. Include gentle, organic movements that match Ghibli's storytelling style.
- Preserve original text in quotes or titles while ensuring the prompt is clear, immersive, and 80-100 words long.
- All prompts must begin with "Studio Ghibli style." No other art styles should be used.
Example Revised Prompts:
"Studio Ghibli style. A young girl with short brown hair and curious eyes stands on a sunlit grassy hill, wind gently rustling her simple white dress. She watches a group of birds soar across the golden sky, her bare feet sinking slightly into the soft earth. The scene is bathed in warm, nostalgic light, with lush trees swaying in the distance. A gentle breeze carries the sounds of nature. Medium shot, slightly low angle, with a slow cinematic pan capturing the serene movement."
"Studio Ghibli style. A small village at sunset, lanterns glowing softly under the eaves of wooden houses. A young boy in a blue yukata runs down a narrow stone path, his sandals tapping against the ground as he chases a firefly. His excited expression reflects in the shimmering river beside him. The atmosphere is rich with warm oranges and cool blues, evoking a peaceful summer evening. Medium shot with a smooth tracking movement following the boy's energetic steps."
"Studio Ghibli style. A mystical forest bathed in morning mist, where towering trees arch over a moss-covered path. A girl in a simple green cloak gently places her hand on the back of a massive, gentle-eyed creature resembling an ancient deer. Its fur shimmers faintly as sunlight pierces through the thick canopy, illuminating drifting pollen. The camera slowly zooms in, emphasizing their quiet connection. A soft gust of wind stirs the leaves, and tiny glowing spirits peek from behind the roots."
Instructions:
I will now provide a prompt for you to rewrite. Please expand and refine it in English while ensuring it adheres to the Studio Ghibli aesthetic. Even if the input is an instruction rather than a description, rewrite it into a complete, visually rich prompt without additional responses or quotation marks.
The prompt is: "YOUR PROMPT HERE".
Replace YOUR PROMPT HERE with something like Young blonde girl stands on the mountain near seashore beach under rain or whatever.
Negative prompt is always the same:
色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走, 3D, MMD, MikuMikuDance, SFM, Source Filmmaker, Blender, Unity, Unreal, CGI, bad quality
Dataset
In this and the following sections, I'll be doing a bit of yapping :) Feel free to skip ahead and just read the Conclusion, but maybe someone will find some useful bits of information in this wall of text. So...
Dataset selection stage was the "easiest" part, I already have all the Ghibli films in highest possible quality and splitted into scenes - over 30,000 clips in 1920x1040 resolution and high bitrate. They're patiently waiting for the day I finally decide to fine-tune some video model with them.
And I had already prepped around 300 clips for training v0.7 of HV LoRA (in fact, I was just about to start the training when Wan came out). These clips were in the range of 65-129 frames, which I consider optimal for training HV on videos, and they were all 24 fps. For Wan, though, I wanted them to be in a different frame range (not exceeding 81 frames, explanation see later in the "Training" section). I also needed them to be in 16 fps. I'm still not entirely sure if strict 16 fps is necessary, but I had some issues with HV when clips were in 30 fps instead of HV’s native 24 fps, so I decided to stick with 16 fps.
I should mention, that for processing dataset, I usually make a lot of small "one-time" scripts (with the help of Claude, ChatGPT, and DeepSeek) - that includes mini-GUIs for manual selection of videos, one-liners for splitting frames, scripts for outputting various helper stats, dissecting clips by ranges, creating buckets in advance, etc. I don't publish these scripts because they're messy, full of hardcoded values, and designed for one-time use anyway. And nowadays anyone can easily create similar scripts by making requests to the aforementioned LLMs.
Converting all clips to 16 fps narrowed the range of frames in each video from 65-129 to around 45-88 frames, which messed up my meticulously planned, frame-perfect ranges for the frame buckets I had set up for training. Thankfully, it wasn't a big deal because I had some rules in place when selecting videos for training, specifically to handle situations like this.
First of all, the scene shouldn't have rapid transitions during its duration. I needed this because I couldn't predict the exact duration (in frames) of target frame buckets that trainer will establish for training - model size, VRAM, and other factors all affect this. Example: I might want to use a single 81-frame long clip for training, but I won't be able to do this, because I will get OOM on RTX 3090. So will have to choose some frame extraction strategy, depending of which clip might be splitted onto several shorter parts (here is excellent breakdown of various strategies). And its semantic coherence might be broken (like, on first fragment of the clip a girl might open her mouth , but from clipped first fragment it will become ambiguous whether she is gonna cry or laugh), and that kind of context incoherence may make Wan's UMT5 encoder feel sad.
Another thing to consider is that I wanted to reuse captions for any fragment of the original clip without dealing with recaptioning and recaching embeddings via the text encoder. Captioning videos takes quite a long time, but if a scene changes drastically throughout its range, the original caption might not fit all fragments, reducing training quality. By following rules "clip should not contain rapid context transitions" and "clip should be self-contained, i.e. it should not feature events that may not be understood from within the clip itself", even if a scene is to be split into subfragments, the captions would (with an acceptable margin of error) still apply to each fragment.
After conversion I looked through all clips and reduced total number of them to 240 (just took out some clips that did contained too much transitions or, vica-versa, were too static), which formed the first part of the dataset.
I decided to use a mixed dataset of videos and images. So second part of the dataset was formed by 120 images (at 768x768 resolution), taken from screencaps of various Ghibli movies.
There's an alternative approach where you train on images first and then fine-tune on videos (it was successfully applied by the creator of this LoRA), but I personally think it's not as good as mixing in a single batch (though I don't have hard numbers to back this up). To back up my assumptions, here is very good LoRA that uses the same mixed approach to training (and btw it was also done on RTX 4090, if I am not mistaken).
To properly enable effective video training on mixed dataset on consumer-level GPUs I had to find the right balance between resolution, duration, and training time, and I decided to do this by mixing low-res high-duration videos and high-res images - I will give more details about this in Training section.
Considering captioning: images for dataset were actually just reused from some of my HV datasets, and they were captioned earlier using my "swiss army knife" VLM for (SFW-only) dataset captioning, also known as Qwen2-VL-7B-Instruct. I used the following captioning prompt:
Create a very detailed description of this scene. Do not use numbered lists or line breaks. IMPORTANT: The output description MUST ALWAYS start with the unaltered phrase 'Studio Ghibli style. ', followed by your detailed description. The description should 1) describe the main content of the scene, 2) describe the environment and lighting details, 3) identify the type of shot (e.g., aerial shot, close-up, medium shot, long shot), and 4) include the atmosphere of the scene (e.g., cozy, tense, mysterious). Here's a template you MUST use: 'Studio Ghibli style. {Primary Subject Action/Description}. {Environment and Lighting Details}. {Style and Technical Specifications}'.
I had some doubts about whether I should recaption them since the target caption structure was specifically designed for HunyuanVideo, and I worried that Wan might need a completely different approach. I left them as-is, and have no idea if this was the right decision, but, broadly speaking, modern text encoders are powerful enough to ignore such limitations. As we know, models like Flux and some others can even be trained without captions at all (although I believe training with captions is always better than without - but only if captions are relevant to the content).
For captioning videos I tested a bunch of local models that can natively caption video content:
CogVLM2-Video-Llama3-Chat (usually this is my go-to option for clip captioning)
Ovis2-16B (this one seems really good! But I had already dataset captioned when I found it, so will use it in future LoRAs)
There are more models out there, but these are the ones I tested. For this LoRA, I ended up using Apollo-7B. I used this simple VLM prompt:
Create a very detailed description of this video. IMPORTANT: The output description MUST ALWAYS start with the unaltered phrase 'Studio Ghibli style. ', followed by your detailed description.
I’m attaching the full dataset I used as an addendum to the model. While it does kinda contain copyrighted material, I think this falls under fair use.
Training
If anyone interested, here is list of trainers that I considered for training WanVideo:
diffusion-pipe - OG of the HV training, but also allows memory-efficient Wan training; config-driven, has third-party GUI and runpod templates (read more here and here). For HV I used it exclusively. Requires WSL to run on Windows.
Musubi Tuner - Maintained by responsible and friendly developer. Config-driven, has cozy community, tons of options. Currently my choice for Wan training.
AI Toolkit - My favorite trainer for Flux recently got support for Wan. It's fast, easy-to-use, config-driven, also has first-party UI (which I do not use 🤷), but currently supports training 14B only without captions, which is the main reason I do not use it.
DiffSynth Studio - I haven't had the time to test it yet and am unsure if it can train Wan models with 24 GB VRAM. However, it’s maintained by ModelScope, making it worth a closer look. I plan to test it soon.
finetrainers - Has support for Wan training, but doesn't seem to work with 24 GB GPUs (yet)
SimpleTuner - Gained support for Wan last week, so I haven't had a chance to try it yet. It definitely deserves attention since the main developer is a truly passionate and knowledgeable person.
Zero-to-Wan - Supports training only for 1.3B models.
WanTraining - I have to mention this project, as it's supported by a developer who’s done impressive work with it, including guidance-distilled LoRA and control LoRA.
So, I used Musubi Tuner. For reference, here are my hardware params: i5-12600KF, RTX 3090, Windows 11, 64Gb RAM. The commands and config files I used were the following.
For caching VAE latents (nothing specific here, just default command)
python wan_cache_latents.py --dataset_config G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_dataset.toml --vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors
For caching text encoder embeddings (default):
python wan_cache_text_encoder_outputs.py --dataset_config G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_dataset.toml --t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth --batch_size 16
For launching training:
accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 wan_train_network.py ^
--task t2v-14B ^
--dit G:/samples/musubi-tuner/wan14b/dit/wan2.1_t2v_14B_bf16.safetensors ^
--vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors ^
--t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth ^
--sdpa ^
--blocks_to_swap 10 ^
--mixed_precision bf16 ^
--fp8_base ^
--fp8_scaled ^
--fp8_t5 ^
--dataset_config G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_dataset.toml ^
--optimizer_type adamw8bit ^
--learning_rate 5e-5 ^
--gradient_checkpointing ^
--max_data_loader_n_workers 2 ^
--persistent_data_loader_workers ^
--network_module networks.lora_wan ^
--network_dim 32 ^
--network_alpha 32 ^
--timestep_sampling shift ^
--discrete_flow_shift 3.0 ^
--save_every_n_epochs 1 ^
--seed 2025 ^
--output_dir G:/samples/musubi-tuner/output ^
--output_name studio_ghibli_wan14b_v01 ^
--log_config ^
--log_with tensorboard ^
--logging_dir G:/samples/musubi-tuner/logs ^
--sample_prompts G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_sampling.txt ^
--save_state ^
--max_train_epochs 50 ^
--sample_every_n_epochs 1
Again, nothing to see here, actually. I had to use blocks_to_swap parameter because otherwise, with my dataset config (see below), I confronted into 24 Gb VRAM constraints. Hyperparameters were mostly left on defaults. I didn't want to risk anything after a bad experience - 60 hours of HV training lost due to getting too ambitious with flow shift values and adaptive optimizers instead of good old adamw.
Prompt file for sampling during training:
# prompt 1
Studio Ghibli style. Woman with blonde hair is walking on the beach, camera zoom out. --w 384 --h 384 --f 45 --d 7 --s 20
# prompt 2
Studio Ghibli style. Woman dancing in the bar. --w 384 --h 384 --f 45 --d 7 --s 20
Dataset configuration (the most important part; I'll explain the thoughts that led me to it afterward):
[general]
caption_extension = ".txt"
enable_bucket = true
bucket_no_upscale = true
[[datasets]]
image_directory = "H:/datasets/studio_ghibli_wan_video_v01/images/768x768"
cache_directory = "H:/datasets/studio_ghibli_wan_video_v01/images/768x768/cache"
resolution = [768, 768]
batch_size = 1
num_repeats = 1
[[datasets]]
video_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040"
cache_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040/cache_1"
resolution = [768, 416]
batch_size = 1
num_repeats = 1
frame_extraction = "head"
target_frames = [1, 21]
[[datasets]]
video_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040"
cache_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040/cache_2"
resolution = [384, 208]
batch_size = 1
num_repeats = 1
frame_extraction = "uniform"
target_frames = [45]
frame_sample = 2
My dataset setup consists of three parts.
I'll start with the last one, which includes the main data array - 240 clips in 1920x1040 resolution and duration that varies from 45 to 88 frames.
Obviously, training on full-resolution 1920x1040, full-duration clips on an RTX 3090 was out of the question. I needed to find the minimum resolution and frame duration that would avoid OOM errors while keeping the bucket fragments as long as possible. Longer fragments help the model learn motion, timing, and spatial patterns (like hair twitching, fabric swaying, liquid dynamics etc.) of the Ghibli style - something you can't achieve with still frames.
From training HV, I remembered a good starting point for estimation of available resolution range for 24 Gb GPU is 512x512x33. I decided on the "uniform" frame extraction pattern, ensuring all extracted fragments were no fewer than 45 frames. Since, as I wrote before, after conversion to 16fps, maxed out at 88 frames, this approach kept the clips from being divided into more than two spans, which would've made epochs too long. At the same time, timespan of 45 frames (~3s) should be enough for model to learn spatial flow of the style.
With the target fixed to 45 frames, I started testing different resolutions. I used a script to analyze all clips in a folder and suggest valid width-height combinations that maintained the original aspect ratio (1920/1040 ≈ 1.85) and were divisible by 16 (a model requirement).
Eventually, I found that using [384, 208] for the bucket size and setting --blocks_to_swap 10 prevented OOM errors and pushing into shared memory (which eventually led to 160 s/it). The downside was that training speed dropped to around 11-12 s/it. In hindsight, lowering the resolution to [368, 192] could have bumped the speed up to ~8 s/it, which would've been great (close to what I get when training Flux at 1024p in AI Toolkit). And that would've saved me around 20 hours of training over the full 90-hour run (~28000 steps), although I didn't expect it to go > 20K steps back then.
And it needs to be noted, that I trained on Windows with my monitor connected to the GPU (and used my PC for coding at the same time 😼). On Linux (for example, with diffusion-pipe) and with using internal GPU for monitor output, it might be possible to use slightly higher spatiotemporal resolutions without hitting OOM or shared memory limits (something I think is Windows-specific).
Now about the first part (120 images in 768x768 resolution). Initially, I wanted to train on 1024p images, but I decided it'd be overkill and slow things down. My plan was to train on HD images and low-res videos simultaneously to ensure better generalization. The idea was that high-resolution images would compensate for the lower resolution of the clips. And joint video + image pretraining is how WAN was trained anyway, so I figured this approach would favor "upstream" style learning as well.
Finally, the second part, which is also important for generalization (again, that is not as "scientific" assumption, but it seems reasonable). The idea was to reuse the same clips from the third section but now train only on the first frame and the first 21 frames. This approach, I hoped, would facilitate learning temporal style motion features. At the same time, it let me bump up the resolution for the second section to [768, 416].
As the result, I hoped to achieve "cross-generalization" between:
Section 1's high-res images (768x768)
Section 2's medium-res single frames and 21-frame clips (768x416)
Section 3's low-res 45-frame clips (384x208)
Additionally, both the second and the larger part of the third sections shared the same starting frame, which I believed would benefit LoRA usage in I2V scenarios. All this seemed like the best way to fully utilize my dataset without hitting hardware limits.
Funny fact: I expected one epoch to consist of 1080 samples: 120 images (1st dataset section) + 240 single frames (2nd dataset section, "head" frame bucket=1) + 240 clips of 21 frames each (2nd dataset section, "head" frame bucket=21) + 480 clips of 45 frames each (2nd dataset section, "uniform" frame bucket=45, sampled 2 times). However, after I started training, I discovered it was actually 1078 samples. When I dug into it, I found that two of the clips reported by my scripts (which use the ffprobe command from ffmpeg to count the number of frames) were actually shorter than 45 frames, so there was an issue with rounding. This wasn't a big deal, so I just continued training without those two clips, but that was the reason the number of steps for the final LoRA seemed so off :)
The training itself went smoothly. I won't reveal loss graphs since I am too shy don't think they mean much. I mostly use them to check if the loss distribution starts looking too similar across epochs - that's my cue for potential overfitting.
I trained up to 28000 steps, then spent several days selecting the best checkpoint. Another thing I think I could have done better is taking checkpoints not just at the end of each epoch, but also in between. Since each epoch is 1078 steps long, it's possible that a checkpoint with even better results than the one I ended up with was lost somewhere in between.
I'm considering integrating validation loss estimation into my training pipeline (more on this here), but I haven't done it yet.
Could this be simplified? Probably yes. In my next LoRA, I'll test whether the extra image dataset in section 1 was redundant. I could've just set up a separate dataset section and reused clips' first frame, but with high resolution. On the other hand, I wanted the dataset to be as varied as possible, so I used screencaps from different scenes than the clips, in this sense they were not redundant.
I'm not even sure if the second section was necessary. Since WAN itself (according to its technical report) was pretrained on 192px clips, training at around 352x192x45 should be effective and make the most of my hardware. Ideally, I'd use 5-second clips (16 fps * 5s + 1 = 81 frames), but that’s just not feasible on the RTX 3090 without aggressive block swapping.
Conclusion
Aside from the fun and the hundreds of insanely good clips, here are some insights I've gained from training this LoRA. I should mention that these practices are based on my personal experience and observations, I don't have any strictly analytical evidence to prove their effectiveness and I only tried style training so far. I plan to explore concept training very soon to test some of my other assumptions and see if they can be applied as well.
You can train Wan-14B on consumer-level GPUs using videos. 368x192x45 seems like a solid starting point.
Compensate for motion-targeted style learning on low-res videos by using high-res images to ensure better generalization.
Combine various frame extraction methods on the same datasets to maximize effectiveness and hardware usage.
A lot, if not all, of what I've learned to make this LoRA comes from reading countless r/StableDiffusion posts, 24/7 lurking on the awesome Banodoco Discord, reading comments and opening every NSFW clip to every single WanVideo model here on Civitai, and diving into every issue I could find in the musubi-tuner, diffusion-pipe, Wan2.1, and other repositories. 😽