Sign In

The Tale of the Princess Kaguya 🎥 Wan2.1-T2V-14B

32

139

0

8

Verified:

SafeTensor

Type

LoRA

Stats

139

0

4

Reviews

Published

Jun 8, 2025

Base Model

Wan Video 14B t2v

Training

Steps: 34,000

Usage Tips

Strength: 1

Trigger Words

Kaguya-hime style
Takenoko
Menowarawa
Sutemaru

Training Images

Download

Hash

AutoV2
98565BBCF6
seruva19's Avatar

seruva19

About

The Tale of the Princess Kaguya (2013) is Isao Takahata's beautiful take on Japan's oldest folktale, The Tale of the Bamboo Cutter. The story centers on a celestial being found as a baby inside a bamboo stalk by a humble bamboo cutter and his wife. As she grows into a stunning young woman, she struggles against the rigid expectations of noble society while longing for the freedom of her simple rural upbringing. The film explores questions of identity, freedom, and life's fleeting nature, creating a thoughtful meditation on what we truly want from life and whether we can ever really find happiness.

What makes this film truly special is its extraordinary visual style, which draws heavily from traditional Japanese sumi-e ink paintings and woodblock prints. Takahata deliberately chose a minimalist, hand-drawn approach that feels almost unfinished at times, but this roughness actually draws you deeper into the story. The watercolor backgrounds are soft and dreamlike, while the character animation uses calligraphic lines that change with the mood, flowing and gentle during peaceful moments, then sharp and erratic when emotions run high. This artistic choice does more than just look beautiful; it connects the film to centuries of Japanese artistic tradition while also breaking away from the polished, computer-perfect animation we're used to seeing. There's something powerful about the way Takahata embraces imperfection and lets your imagination fill in the gaps. The result is a film that feels both ancient and completely fresh, earning widespread critical praise and establishing itself as one of the finest animated films ever made.

Description

This LoRA, like Redline, continues a series of animation styles that are impossible to fully express in simple static frames. It also tries to push further (though not entirely successfully) in breaking away from Wan's default 3D render bias, aiming to fully replace it with a 2D hand-drawn animation style. The goal was to transform not just the visual aesthetic, but also the movement, pacing, composition, and overall energy. Of course, one small LoRA alone isn't enough to achieve all of this.

🙂 Nevertheless, the model learned the style decently, better than I expected, copying such traits of The Tale of the Princess Kaguya style as simplified landscapes, smart usage of blank space, ink-wash drawing style, effect of unfinished painting, bends of fabric, depictions of emotions, movement, stillness, and minimalism. It generalized pretty well, depicting not only authentic landscapes and people, but scenes from the future, medieval scenes, modern times, etc.

🙁 One pesky flaw I found only during testing and that I could not fix: on some videos, there can appear strange glares on certain objects. Earlier checkpoints (like at 15K-19K steps) also show this effect, so this is not overfitting (the LoRA has no bias toward frame presence of specific structures, which could indicate overfitting, captions of the dataset itself, used as prompts, may render completely different videos than in original dataset). Disabling optimizations (Sage Attention etc.) and playing with sampling parameters does not help much. Lowering LoRA strength helps to negate the effect, but leans further away from the target style. I tried to negate it by turning off some LoRA blocks, but no luck.

The worst thing about this effect is that it's totally unpredictable and I just don't like that I don't understand the reason behind it. My best guess is that there is some implicit pattern of the style (say, soft, diffused lighting with almost no shadows) that the model tries to interpolate and reproduce on some entities that the dataset did not contain - but it does so improperly. The training data lacks examples of strong directional lighting, such as sun glare, bloom, or specular highlights, because this is specific to the style of The Tale of the Princess Kaguya. So, when prompted with scenes that suggest some lighting source, the model faces a contradiction between the learned and prior representations, and it "hallucinates" visual elements such as soft glows, edge blooms, or faint glares.

⌛ If I manage to find out the definite cause of this effect, I will retrain (or make a calibration post-tune) of the LoRA. But work on this model turned out long and exhausting, and right now I cannot afford another 2-3 weeks of maybe pointless experiments.

Usage

Add "Kaguya-hime style" to the prompt. It probably can work without trigger words, I just usually add it anyway and never tested it without, so I don't know how it behaves without them.

I use Kijai's wrapper, but it should work with the native workflow as well.

All videos were created using base WanVideo2.1-14B-T2V model, each video contains an embedded ComfyUI workflow.

Example of workflow in JSON is here.

Dataset

Dataset was sourced from The Tale of the Princess Kaguya film. It was split using PySceneDetect, then I converted clips to 16fps and manually chose 295 of them. A better decision would have been to cut it down (to 100–150), such a large number of clips is only justified in the case of a diverse dataset. In my case, I couldn't select fewer: each clip is a piece of art, and it was hard to decide which to leave out and which to keep. I also extracted about 1000 frames from these videos (using ffmpeg), from which I manually selected 240 images to form a high-res image dataset.

The dataset structure was intended to somehow mimic the bucket structure of the original data Wan was trained on (according to the official WanVideo report), which contained videos and images in 720p, 480p, and 192p. According to this:

1️⃣ The first part of the dataset was 255 images with source resolution 1920x1040, training resolution 1328x720px (720p).

Images were captioned with Qwen2.5-VL-7B-Instruct with a prompt:

You are an expert visual-scene describer.
For the following animated frame (a still extracted from a video scene), write a detailed, highly descriptive caption that:
- Is around 80-100 words long.
- Begins with the exact phrase: "Kaguya-hime style".
- Uses present-tense, simple, concise and concrete language that describes only what is visible in the frame.
- Follows the order "Subject → Scene → Implied Motion/Atmosphere" (e.g., "Kaguya-hime style, a small boy in a gray tunic stands beside a wooden gate at dawn, morning mist wafts around the thatched roofs behind him.").
- Includes precise details (age, gender, clothing color, major objects, environment, weather, time of day).
- Contains no emotional adjectives, no abstract narrative, and no style words except the required prefix.
- For scenes with multiple subjects, focuses on the primary figure(s) in the action.
Use the following template: "Kaguya-hime style, [optional shot type (close-up, medium shot, wide shot)] of a [subject with visual details] [pose/static position or gentle implied action]. [Detailed setting]. [Subtle dynamic element or atmospheric cue (wind, drifting petals, rippling water, lantern glow) to avoid static feel]. [Additional visual context or background detail]." (example: "Kaguya-hime style, an elderly craftsman in a worn gray kimono carves a wooden figure in a humble workshop. Light dust particles floating in the sunbeams. Tools and wood shavings scatter across the low table. Shadows lengthen across the tatami floor.")

2️⃣ The second part of the dataset was 295 clips with source resolution 1920x1040, training resolution 880x480px (480p). In the dataset config they were assigned:

  • frame_extraction = "head" and target_frames = [13]

    (the max number of frames I could afford on an RTX 3090 at this resolution without sacrificing training speed too much).

This part of the dataset was also captioned with Qwen2.5-VL-7B-Instruct and the prompt (enforcing detailed captions and focusing on details):

You are an expert visual-scene describer.
For the following animated video clip, write a detailed, highly descriptive caption that:
- Is around 80-100 words long.
- Begins with the exact phrase: "Kaguya-hime style".
- Uses present-tense, simple, concise and concrete language that describes only what is visible on-screen.
- Follows the order "Subject → Scene → Motion/Camera" (e.g., "Kaguya-hime style, a young woman in a white kimono. She walks through a moonlit bamboo grove. Camera pans slowly as fireflies drift around her.")
- Includes camera movement when visible (pans, zooms, tilts). If no obvious camera movement, focus on subject and environmental motion
- Includes precise details (age, gender, clothing color, major objects, environment, weather, time of day, camera movement).
- Contains no emotional adjectives, no abstract narrative, and no style words except the required prefix.
- For scenes with multiple subjects, focuses on the primary figure(s) in the action.
- Emphasizes any visible motion, including subtle movements like fabric swaying, particle effects, or environmental changes.
Use the following template: "Kaguya-hime style, [optional shot type, if clear] of a [subject description with visual details] [action/motion]. [Detailed setting description]. [Camera movement]. [Additional background elements or atmospheric details]." (example: "Kaguya-hime style, medium shot of a young woman with long black hair in a simple white kimono walking slowly through a bamboo forest at dusk. Camera panning alongside her. Golden light filters through the swaying bamboo stalks. Fallen leaves scatter in her path."

3️⃣ For the third part, the same 295 videos were used, training resolution 352x192px (192p). The maximum number of frames I could use was 49. These videos were divided into three groups:

  • 34 to 49 frames: target_frames = [33], frame_extraction = "uniform", frame_sample = 2

  • 50 to 100 frames: target_frames = [49], frame_extraction = "uniform", frame_sample = 2

  • 101 to 160 frames: target_frames = [49], frame_extraction = "uniform", frame_sample = 3

This part of the dataset was captioned as well with Qwen2.5-VL-7B-Instruct and the following prompt (enforcing brief captions, not focusing on details):

You are an expert visual-scene describer for animated video clips in the elegant, ink-wash-inspired aesthetic. Write a one-sentence caption, 15-30 words, following this template:
"Kaguya-hime style, [main subject] [action/movement or state] in/on [specific location with background elements], with [vivid details: color, lighting, weather, atmosphere], during [time of day or context-appropriate temporal description]."
Example: "Kaguya-hime style, a lone samurai wanders through a misty bamboo grove, with moonlight casting soft shadows, during a tranquil midnight, evoking timeless ink-wash serenity."
Requirements:
- Identify the primary subject (e.g., character, animal, object).
- Describe its main action, movement, or state (use vivid verbs or adjectives).
- Specify the setting, including background elements (e.g., forests, rivers, architecture, or abstract motifs if minimal).
- Include rich visual details (e.g., shimmering moonlight, vibrant hues, misty air).
- Indicate time of day (e.g., dawn, twilight) or a fitting temporal context (e.g., "eternal night" for fantastical scenes).
- If the scene lacks a clear subject or time, prioritize vivid setting and atmosphere.
- Ensure captions are concise, evocative, and flow naturally.

All captions were manually checked (and I had to make plenty of corrections). Some characters were tagged explicitly ("Takenoko", "Menowarawa", and "Sutemaru"), although that was done just for fun. Using these tags does not guarantee the exact character will be reproduced. Actually this is not the ideal way to caption characters for a style LoRA, a better approach would be to tag characters only when they are the sole subjects in the frame. But that was totally fine for me this time, since I mostly planned to reproduce the "averaged" style, not specific characters.

🗃️ Here is full toml file for dataset config (note: the 2nd part of the dataset - 480p videos - also got three sections in config, although the parameters are identical for all across these sections, it's because the clips had already been sorted into three duration-based folders - 34-49, 50-100, 101-160 frames - for the 192p version, and I just reused the same folder structure in the 480p version for consistency and easier dataset management):

[general]
enable_bucket = true
bucket_no_upscale = true

[[datasets]]
image_directory = "H:/datasets/princess_kaguya/images/1920x1040/1"
cache_directory = "H:/datasets/princess_kaguya/images/1920x1040/1/cache_highres"
caption_extension = ".highres"
resolution = [1328, 720]
batch_size = 1
num_repeats = 1

[[datasets]]
video_directory = "H:/datasets/princess_kaguya/videos/1920x1040/34-49"
cache_directory = "H:/datasets/princess_kaguya/videos/1920x1040/34-49/cache_mediumres"
caption_extension = ".mediumres"
resolution = [880, 480]
batch_size = 1
num_repeats = 1
frame_extraction = "head"
target_frames = [13]

[[datasets]]
video_directory = "H:/datasets/princess_kaguya/videos/1920x1040/50-100"
cache_directory = "H:/datasets/princess_kaguya/videos/1920x1040/50-100/cache_mediumres"
caption_extension = ".mediumres"
resolution = [880, 480]
batch_size = 1
num_repeats = 1
frame_extraction = "head"
target_frames = [13]

[[datasets]]
video_directory = "H:/datasets/princess_kaguya/videos/1920x1040/101-160"
cache_directory = "H:/datasets/princess_kaguya/videos/1920x1040/101-160/cache_mediumres"
caption_extension = ".mediumres"
resolution = [880, 480]
batch_size = 1
num_repeats = 1
frame_extraction = "head"
target_frames = [13]

[[datasets]]
video_directory = "H:/datasets/princess_kaguya/videos/1920x1040/34-49"
cache_directory = "H:/datasets/princess_kaguya/videos/1920x1040/34-49/cache_lowres"
caption_extension = ".lowres"
resolution = [352, 192]
batch_size = 1
num_repeats = 1
frame_extraction = "uniform"
target_frames = [33]
frame_sample = 2

[[datasets]]
video_directory = "H:/datasets/princess_kaguya/videos/1920x1040/50-100"
cache_directory = "H:/datasets/princess_kaguya/videos/1920x1040/50-100/cache_lowres"
caption_extension = ".lowres"
resolution = [352, 192]
batch_size = 1
num_repeats = 1
frame_extraction = "uniform"
target_frames = [49]
frame_sample = 2

[[datasets]]
video_directory = "H:/datasets/princess_kaguya/videos/1920x1040/101-160"
cache_directory = "H:/datasets/princess_kaguya/videos/1920x1040/101-160/cache_lowres"
caption_extension = ".lowres"
resolution = [352, 192]
batch_size = 1
num_repeats = 1
frame_extraction = "uniform"
target_frames = [49]
frame_sample = 3

(I've uploaded the whole source dataset as well.)

Training

I used musubi-tuner for training (Windows 11, 64Gb RAM, RTX 3090).

There is nothing particularly interesting in training parameters itself, they were mostly taken from Studio Ghibli LoRA.

🗃️ Here is example of batch script I used to launch training:

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision fp16 wan_train_network.py ^
    --task t2v-14B ^
	--vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors ^
	--t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth ^
	--dit E:/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/wan/wan2.1_t2v_14B_fp16.safetensors ^
	--blocks_to_swap 15 ^
	--flash_attn ^
	--mixed_precision fp16 ^
	--fp8_base ^
	--fp8_scaled ^
	--dataset_config G:/samples/musubi-tuner/_kaguya_wan14b_dataset.toml ^
	--gradient_checkpointing ^
    --max_data_loader_n_workers 2 ^
	--persistent_data_loader_workers ^
	--learning_rate 6e-5 ^
	--lr_scheduler constant_with_warmup ^
    --lr_warmup_steps 100 ^
    --optimizer_type adamw8bit ^
	--optimizer_args weight_decay=0.01 ^
    --network_module networks.lora_wan ^
	--network_dim 32 ^
	--network_alpha 32 ^
    --timestep_sampling shift ^
	--discrete_flow_shift 3.0 ^
    --output_dir G:/samples/musubi-tuner/output ^
	--output_name kaguya_wan14b ^
    --log_config ^
	--log_with all ^
	--wandb_api_key MY_WANDB_API_KEY ^
	--wandb_run_name kaguya ^
	--logging_dir G:/samples/musubi-tuner/logs ^
	--sample_prompts G:/samples/musubi-tuner/_kaguya_wan14b_sampling.txt ^
	--save_state ^
	--sample_every_n_steps 500 ^
	--save_every_n_steps 500  ^
	--max_train_epochs 50

Training was done until 42000 steps, then I experimented with various checkpoints and selected the best one at 34000 steps. Although I didn't care about the the loss much (the training samples demonstrated positive dynamics, confirming the model was learning effectively without memorizing the data, which was more important than loss values), I can mention that it steadily decreased from around 0.1 to 0.09 over the training period, showing consistent convergence without signs of overfitting.

P.S. I also attempted to post-finetune final LoRA weights on small synthetic dataset, trying to mitigate effects mentioned in the beginning, but it did not prove itself effective and even made the model worse.