How to train HunyuanVideo LoRa using Musubi Tuner

UPDATE FROM 24 JAN 2025

This is post update how I trained my 2nd LoRa "Porn Movie Director":

First commands for training:

python cache_latents.py --dataset_config configs/config11.toml --vae ckpts/hunyuan-video-t2v-720p/vae/pytorch_model.pt --vae_chunk_size 32 --vae_tiling

python cache_text_encoder_outputs.py --dataset_config configs/config11.toml --text_encoder1 ckpts/text_encoder/llava_llama3_fp16.safetensors --text_encoder2 ckpts/text_encoder_2/clip_l.safetensors --batch_size 16

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 hv_train_network.py --dit ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.safetensors --dataset_config configs/config11.toml --sdpa --split_attn --blocks_to_swap 18 --mixed_precision bf16 --fp8_base --optimizer_type adamw8bit --learning_rate 2e-4 --gradient_checkpointing --max_data_loader_n_workers 2 --persistent_data_loader_workers --network_module=networks.lora --network_dim=32 --network_args "loraplus_lr_ratio=4" --timestep_sampling shift --discrete_flow_shift 7.0 --max_train_epochs 500 --save_every_n_epochs=5 --seed 42 --log_config --log_with tensorboard --logging_dir vlogs --output_dir outputs --output_name mul1t

python convert_lora.py --input outputs/mul1t-000050.safetensors --output converted/mul1t-000050-converted.safetensors --target other

Second day of training - I continue training using previous LoRa file.

python convert_lora.py --input outputs/mul2t-000035.safetensors --output converted/mul2t-000035-conv.safetensors --target other

This is a final console command that I used to train 3rd iteration of LoRa and then convert it to ComfyUI compatible format:

python convert_lora.py --input outputs/mul3t-000035.safetensors --output converted/mul3t-000035-conv.safetensors --target other

config11.toml:

# resolution, caption_extension, batch_size, enable_bucket, bucket_no_upscale must be set in either general or datasets

# general configurations

[general]

caption_extension = ".txt"

batch_size = 1

enable_bucket = true

bucket_no_upscale = false

# 6 seconds (face + suck + fuck)

[[datasets]]

resolution = [480, 272]

video_directory = "train11/train_data6_6s_full"

cache_directory = "cache6_01"

target_frames = [145]

frame_extraction = "head"

# 4 seconds (face + suck)

[[datasets]]

resolution = [480, 272]

video_directory = "train11/train_data6_4s_face_suck"

cache_directory = "cache6_02"

target_frames = [97]

frame_extraction = "head"

# 4 seconds (face + fuck)

[[datasets]]

resolution = [480, 272]

video_directory = "train11/train_data6_4s_face_fuck"

cache_directory = "cache6_03"

target_frames = [97]

frame_extraction = "head"

# 2 seconds (face)

[[datasets]]

resolution = [640, 362]

video_directory = "train11/train_data6_2s_face"

cache_directory = "cache6_04"

target_frames = [49]

frame_extraction = "head"

# 2 seconds (suck)

[[datasets]]

resolution = [640, 362]

video_directory = "train11/train_data6_2s_suck"

cache_directory = "cache6_05"

target_frames = [49]

frame_extraction = "head"

# 2 seconds (fuck)

[[datasets]]

resolution = [640, 362]

video_directory = "train11/train_data6_2s_fuck"

cache_directory = "cache6_06"

target_frames = [49]

frame_extraction = "head"

# other datasets can be added here. each dataset can have different configurations

You need to rename dataset files from:

train11.zip.001

train11.zip.002

train11.zip.003

train11.zip.004

because it is a multi zip archive but civitai does not allow attaching files with 001, 002 extensions.

Notes to myself:

Note 1:

I use "Movie Director" with my another private LoRa trained on images only to change people look to make them from more real (street view real photographs) and to combine two LoRas I use "HunyuanVideo Lora Block Edit" nodes and currently I am not very satisfied with results of mixing two Loras because either Hunyuan forgets about video lora movements, scene arrangement (if I activate all blocks in image lora) or forgets about the look if I activate only double blocks in that node. I will try to train next image LoRa with double blocks only - there is a way to do it in musubi tuner by passing some "network" exclude param.

Note 2:

I want to do another experiment - create a "Sex Video Container" lora trained on 3 unique tags only associated with visual marker overlays (red triangles, blue circles, green squares). Each video will be 2 secs (48 frames) and will feature a different sex pose from a porn video with the same actress. This way I hope that the model will understand that each trigger word is associated with random sex scene or pose but woman appearance must be preserved. If experiment succeeds - I will be able to use my "Container" LoRa with 3 another video LoRas composing a unique movie in a sequence that I want. For example, I will download LoRas:

1) breast massage (Lora with trigger word "m@ssage"

2) fucking in doggystyle (trigger word "doggy1x")

3) tentacle fuck (trigger word tent@cle$)

Then I will format prompt as:

cn81t: A black man doing m@ssage to a woman named Julie with blonde hair.

Then z15xj: doggy1x An old man is fucking Julie in ass

Then y9s8z: Julie is fucked by tent@cle$ in her pussy

I hope that my LoRa trained on three trigger words only (cn81t, z15xj, y9s8z) will act as a glue to compose three scenes together and preserve character (woman) unique appearance across three shots.

Note 3: I should probably train 3-scenes LoRa not on 145 frames but on 129 frames only because in official HunyaunVideo scientific paper they wrote that they trained the model on 129 frames and possibly my additional frames made LoRa worse.

====================================================

OLD POST:

Follow installation instructions on Musubi Tuner here:

https://github.com/kohya-ss/musubi-tuner

After everything is installed (I used miniconda virtual environment) and working here are the commands I used to generate this LoRa:

https://civitai.com/models/1130125

python cache_latents.py --dataset_config configs/config5.toml --vae ckpts/hunyuan-video-t2v-720p/vae/pytorch_model.pt --vae_chunk_size 32 --vae_tiling

python cache_text_encoder_outputs.py --dataset_config configs/config5.toml --text_encoder1 ckpts/text_encoder/llava_llama3_fp16.safetensors --text_encoder2 ckpts/text_encoder_2/clip_l.safetensors --batch_size 16

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 hv_train_network.py --dit ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt --dataset_config configs/config5.toml --sdpa --mixed_precision bf16 --fp8_base --optimizer_type adamw8bit --learning_rate 2e-4 --gradient_checkpointing --max_data_loader_n_workers 2 --persistent_data_loader_workers --network_module=networks.lora --network_dim=32 --timestep_sampling sigmoid --discrete_flow_shift 1.0 --max_train_epochs 100 --save_every_n_epochs=3 --seed 42 --output_dir outputs --output_name pr0neb0ne

python convert_lora.py --input outputs/pr0neb0ne.safetensors --output converted/pr0neb0ne-converted.safetensors --target other

It took me 18hrs on RTX 3090 to generate this LoRa. If you find more optimized settings - please share them in comments section below.

Content of train_data5 folder is below. cache5 folder was empty.

Each video is 480x270px, 24fps, 100 frames.

To resize videos to this resolution (they are 1920*1080 divided by four) I asked ChatGPT for help - it wrote me python scripts that resized videos with ffmpeg, changed fps, and trim them to 100 frames only. Below .txt files are just text prompts that you can see in my LoRa page.

train_data5 folder:

(Files are named 03a, 03b, 03c because they are from the same video)

In total there were only 6 different videos.

01 - side view

02 - side view

03 - closeup

04 - front view

05 - back view

06 - side view

01a.mp4
01a.txt
01b.mp4
01b.txt
02a.mp4
02a.txt
02b.mp4
02b.txt
03a.mp4
03a.txt
04a.mp4
04a.txt
04b.mp4
04b.txt
04c.mp4
04c.txt
05a.mp4
05a.txt
06a.mp4
06a.txt
06b.mp4
06b.txt
06c.mp4
06c.txt
06d.mp4
06d.txt

config5.toml:

# resolution, caption_extension, batch_size, enable_bucket, bucket_no_upscale must be set in either general or datasets

# general configurations
[general]
resolution = [480, 270]
caption_extension = ".txt"
batch_size = 1
enable_bucket = true
bucket_no_upscale = false

[[datasets]]
video_directory = "train_data5"
cache_directory = "cache5"
target_frames = [1, 25, 45, 73]
frame_extraction = "head"
# other datasets can be added here. each dataset can have different configurations

How to train HunyuanVideo LoRa using Musubi Tuner

Comments