Type | |
Stats | 453 |
Reviews | (74) |
Published | Dec 30, 2024 |
Base Model | |
Training | Steps: 8,778 Epochs: 19 |
Usage Tips | Strength: 1 |
Trigger Words | A scene from a Studio Ghibli animated film |
Training Images | Download |
Hash | AutoV2 C4CC487F14 |
Disclaimer
Despite HunyuanVideo having some bits of knowledge about Studio Ghibli's art style (at least it turns on "retro mode" anime when prompting Ghibli), it is not very consistent, and sometimes falls back to realistic renders. And the shading, palette and linework are quite different. So by making this LoRA I wanted to try to reinforce the Ghibli art style for HunyuanVideo.
This is the third version of the LoRA. The first two versions were not successful, I did not publish them. And this is not final version, I will work on improving it.
upd. 05/01/2025 Trained v.0.4 with musubi-tuner, but this LoRA turned out worse than v0.3, so I won't publish it (and returning to good old diffusion-pipe). Meanwhile finished preparing the mixed dataset for v.0.5 (same 185 images, but higher resolution + 765 short videoclips), hopefully new version of LoRA will be ready in 4-5 (upd. 08/01) 7-8 days, the training is ~2-3 times slower this time.
I am still figuring out how to train HV and still do not know the best way to prompt it, etc., take that into consideration.
Usage
For inference I use the default ComfyUI pipeline with just an additional LoRA loader node. Kijai's wrapper should work too (at least it worked a week ago, but after that I switched to native workflow). A parameters are default except:
guidance: 7.0
steps: 30
That does not mean they are optimal, it's just I mostly generated clips using them, but maybe some other combinations might deliver better results.
The prompt template I am currently using is like this:
A scene from a Studio Ghibli animated film, featuring [CHARACTER DESCRIPTION], as they [ACTION] at [ENVIRONMENT], under [LIGHTING], with [ADDITIONAL SETTING DETAILS], while the camera [CAMERA WORK], emphasizing [MOOD AND AMBIANCE].
I usually input a set of tags to LLM, like "blonde woman, barefeet, ocean seashore, fine weather, etc." and ask to output a cohesive prompt in natural language according the this template.
Training
Please have in mind that my training routine is not optimal, I am just testing and experimenting, so it's possible it worked not because it is good, but despite being bad.
Current version of LoRA is trained on 185 fragments (512x512) of screencaps from various Ghibli movies. They were captioned with CogVLM2. Captioning prompt was:
Create a very detailed description of this image as if it was a frame from Studio Ghibli movie. The description should necessarily 1) describe the main content of the scene, detail the scene's content, which notably includes scene transitions and camera movements that are integrated with the visual content, such as camera follows some subject 2) describe the environment in which the subject is situated 3) identify the type of video shot that highlights or emphasizes specific visual content, such as aerial shot, close-up shot, medium shot, or long shot 4) include description of the atmosphere of the video, such as cozy, tense, or mysterious. Do not use numbered lists or line breaks. IMPORTANT: output description MUST ALWAYS start with unaltered phrase 'A scene from Studio Ghibli animated film, featuring...', and then insert your detailed description.
For training I used diffusion-pipe. Other possible choices are finetrainers (but it currently does require > 24GB VRAM to train HV) and musubi-tuner (I yet failed to get good results with it, although it's not the software's fault).
Training was done on Windows 11 Home (WSL2) with 64 GB RAM, on a single RTX 3090. Training parameters were default (main, dataset), except:
rank = 16
lr = 6e-5
I saved on each epoch and got 20 epochs, each took 462 steps, so 9240 in total. The speed on RTX 3090 was approximately 7 s/it (so, each epoch took slightly less than 1 hour to train). After testing epochs from 13 to 20, I chose epoch 19 as it was most consistent and gave fewer errors.
The result is still far from perfect, but I hope to deliver upgraded versions. Next version will probably be trained on clips instead of images, but I need time to prepare the dataset.
Also it is quite possible that upcoming I2V model will render style LoRAs useless.
P.S. Just lyrical moment: I'm still amazed we got such an outstanding local video model. I feel like now it's really a Stable Diffusion moment for local video generation. No doubt we will get more models in the future that will surpass it, but HunyuanVideo will always be the first one, at least for me ❤️