Sign In

Studio Ghibli Style 🎞️ HunyuanVideo

77
474
22
Type
LoRA
Stats
474
Reviews
Published
Dec 30, 2024
Base Model
Hunyuan Video
Training
Steps: 8,778
Epochs: 19
Usage Tips
Strength: 1
Trigger Words
A scene from a Studio Ghibli animated film
Training Images
Download
Hash
AutoV2
C4CC487F14
seruva19's Avatar
seruva19
Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright © 2024 Tencent. All Rights Reserved. The trademark rights of “Tencent Hunyuan” are owned by Tencent or its affiliate.
Powered by Tencent Hunyuan

Disclaimer

Despite base HunyuanVideo knows generic anime style well and has some knowledge about Studio Ghibli's art style, it is not consistent, very prompt-dependant and can sometimes fall back to realistic style. And the shading, palette and linework can be quite different. So by making this LoRA I wanted to try to reinforce the Ghibli art style for HunyuanVideo.

This is the third version of the LoRA. The first two versions were not successful, I did not publish them. And this is not final version, I am working on improving it.

upd. 05/01/2025 Done training v.0.4 with musubi-tuner, but it was worse than v0.3, so I won't publish it (and will use diffusion-pipe for v0.5).

upd. 08/01/2025 Training v0.5 is progressing, but slower than I’d hoped. Perhaps I was too optimistic about the speed of training 750 clips at a resolution of 512x* (max frames 33) on an RTX 3090. And at epoch 11, the results are still not good enough. Meanwhile, I’m also training another anime LoRA with musubi-tuner 😅.

I am still figuring out how to train HV and still do not know the best way to prompt it, etc., take that into consideration.

Usage

For inference I use the default ComfyUI pipeline with just an additional LoRA loader node. Kijai's wrapper should work too (at least it worked a week ago, but after that I switched to native workflow). A parameters are default except:

guidance: 7.0
steps: 30

That does not mean they are optimal, it's just I mostly generated clips using them, but maybe some other combinations might deliver better results.

The prompt template I am currently using is like this:

A scene from a Studio Ghibli animated film, featuring [CHARACTER DESCRIPTION], as they [ACTION] at [ENVIRONMENT], under [LIGHTING], with [ADDITIONAL SETTING DETAILS], while the camera [CAMERA WORK], emphasizing [MOOD AND AMBIANCE].

I usually input a set of tags to LLM, like "blonde woman, barefeet, ocean seashore, fine weather, etc." and ask to output a cohesive prompt in natural language according the this template.

Training

Please have in mind that my training routine is not optimal, I am just testing and experimenting, so it's possible it worked not because it is good, but despite being bad.

Current version of LoRA is trained on 185 fragments (512x512) of screencaps from various Ghibli movies. They were captioned with CogVLM2. Captioning prompt was:

Create a very detailed description of this image as if it was a frame from Studio Ghibli movie. The description should necessarily 1) describe the main content of the scene, detail the scene's content, which notably includes scene transitions and camera movements that are integrated with the visual content, such as camera follows some subject 2) describe the environment in which the subject is situated 3) identify the type of video shot that highlights or emphasizes specific visual content, such as aerial shot, close-up shot, medium shot, or long shot 4) include description of the atmosphere of the video, such as cozy, tense, or mysterious. Do not use numbered lists or line breaks. IMPORTANT: output description MUST ALWAYS start with unaltered phrase 'A scene from Studio Ghibli animated film, featuring...', and then insert your detailed description.

For training I used diffusion-pipe. Other possible choices are finetrainers (but it currently does require > 24GB VRAM to train HV) and musubi-tuner (I yet failed to get good results with it, although it's not the software's fault).

Training was done on Windows 11 Home (WSL2) with 64 GB RAM, on a single RTX 3090. Training parameters were default (main, dataset), except:

rank = 16
lr = 6e-5

I saved on each epoch and got 20 epochs, each took 462 steps, so 9240 in total. The speed on RTX 3090 was approximately 7 s/it (so, each epoch took slightly less than 1 hour to train). After testing epochs from 13 to 20, I chose epoch 19 as it was most consistent and gave fewer errors.

The result is still far from perfect, but I hope to deliver upgraded versions. Next version will probably be trained on clips instead of images, but I need time to prepare the dataset.

Also it is quite possible that upcoming I2V model will render style LoRAs useless.

P.S. Just lyrical moment: I'm still amazed we got such an outstanding local video model. I feel like now it's really a Stable Diffusion moment for local video generation. No doubt we will get more models in the future that will surpass it, but HunyuanVideo will always be the first one, at least for me ❤️