home models images videos posts articles bounties challenges events updates shop

Studio Ghibli 🎥 HunyuanVideo

Name: Studio Ghibli 🎥 HunyuanVideo
Rating: 5 (219 reviews)
Author: seruva19

219

2.3k

Updated: Aug 1, 2025

style

anime studio ghibli ghibli hunyuanvideo

Verified: 10 months ago

SafeTensor

Details

Type	LoRA
Stats	2,323 0 141
Reviews	Very Positive (219)
Published	Dec 30, 2024
Base Model	Hunyuan Video
Training	Steps: 8,778 Epochs: 19
Usage Tips	Strength: 1
Trigger Words	A scene from a Studio Ghibli animated film
Training Images	Download
Hash	AutoV2 C4CC487F14

2 Files

seruva19

Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright © 2024 Tencent. All Rights Reserved. The trademark rights of “Tencent Hunyuan” are owned by Tencent or its affiliate.

Disclaimer

Although base HunyuanVideo knows generic anime style well without need for LoRAs and has some knowledge about Studio Ghibli's art style, the latter is not consistent, very prompt-dependant and can sometimes fall back to realistic style. And the shading, palette and linework can be quite different. So by making this LoRA I wanted to try to reinforce the Ghibli art style for HunyuanVideo.

This is the third version of the LoRA. The first two versions were not successful, I did not publish them.

upd. 01/08/2025 Unfortunately, I got no spare time to work on older models, so retraining this one isn't planned any more.

upd. 14/03/2025 After testing Wan2.1-14B-T2V for a week, I must acknowledge that it is superior to HV. Therefore, I will shift to Wan training and do not plan to release any more HV models. However, I'll make my best to release an update to Ghibli LoRA (once I'm done with some of my other planned Flux/Wan models), because I still feel obligated to finalize it with proper training using videos, not just images.

upd. 02/03/2025 I got carried away with Lumina-2 and Wan-2.1 and got back to Flux training again, so v0.7 will be slightly postponed. But I will definitely release it (probably along with one more anime LoRA).

upd. 08/02/2025 v0.6 was a disappointment. I made several risky decisions that did not justify themselves and weren’t worth the 84 hours of training on an RTX 3090. Stay tuned for v0.7! 🙂

upd. 05/01/2025 Done training v.0.4 with musubi-tuner, but it was worse than v0.3, so I won't publish it (and will use diffusion-pipe for v0.5).

upd. 21/01/2025 I made too much mistakes while training v0.5, so I decided to drop it and start from scratch with upgraded dataset and training parameters (and try musubi one more time). 32 hours wasted, but it's for good :)

Usage

For inference I use the default ComfyUI pipeline with just an additional LoRA loader node. Kijai's wrapper should work too (at least it worked a week ago, but after that I switched to native workflow). A parameters are default except:

guidance: 7.0
steps: 30

That does not mean they are optimal, it's just I mostly generated clips using them, but maybe some other combinations might deliver better results.

The prompt template I am currently using is like this:

A scene from a Studio Ghibli animated film, featuring [CHARACTER DESCRIPTION], as they [ACTION] at [ENVIRONMENT], under [LIGHTING], with [ADDITIONAL SETTING DETAILS], while the camera [CAMERA WORK], emphasizing [MOOD AND AMBIANCE].

I usually input a set of tags to LLM, like "blonde woman, barefeet, ocean seashore, fine weather, etc." and ask to output a cohesive prompt in natural language according the this template.

Training

Please have in mind that my training routine is not optimal, I am just testing and experimenting, so it's possible it worked not because it is good, but despite being bad.

Current version of LoRA is trained on 185 fragments (512x512) of screencaps from various Ghibli movies. They were captioned with CogVLM2. Captioning prompt was:

Create a very detailed description of this image as if it was a frame from Studio Ghibli movie. The description should necessarily 1) describe the main content of the scene, detail the scene's content, which notably includes scene transitions and camera movements that are integrated with the visual content, such as camera follows some subject 2) describe the environment in which the subject is situated 3) identify the type of video shot that highlights or emphasizes specific visual content, such as aerial shot, close-up shot, medium shot, or long shot 4) include description of the atmosphere of the video, such as cozy, tense, or mysterious. Do not use numbered lists or line breaks. IMPORTANT: output description MUST ALWAYS start with unaltered phrase 'A scene from Studio Ghibli animated film, featuring...', and then insert your detailed description.

For training I used diffusion-pipe. Other possible choices are finetrainers (it currently does require > 24GB VRAM to train HV), musubi-tuner (I yet failed to get good results with it, although it's not the software's fault) and OneTrainer (but I never tried it).

Training was done on Windows 11 Home (WSL2) with 64 GB RAM, on a single RTX 3090. Training parameters were default (main, dataset), except:

rank = 16
lr = 6e-5

I saved on each epoch and got 20 epochs, each took 462 steps, so 9240 in total. The speed on RTX 3090 was approximately 7 s/it (so, each epoch took slightly less than 1 hour to train). After testing epochs from 13 to 20, I chose epoch 19 as it was most consistent and gave fewer errors.

The result is still far from perfect, but I hope to deliver upgraded versions. Next version will probably be trained on clips instead of images, but I need time to prepare the dataset.

Also it is quite possible that upcoming I2V model will render style LoRAs useless.

P.S.

I'm still amazed we got such an outstanding local video model. I feel like now it's really a Stable Diffusion moment for local video generation. No doubt we will get more models in the future that will surpass it, but HunyuanVideo will always be the first of its kind, at least for me ❤️