OIIA (Spinning Cat) [LTX-2.3]

https://civitai.com/models/2018677/oiia-spinning-cat

Trigger Word: OIIA_cat
Model: LTX-2.3 22B dev
For inference used two different workflow: custom with IC-LoRA-Detailer and gguf_Q8
Fork of musubi-tuner was used for training.

Musubi-tuner

My first exposure to the world of learning lora for video was through musubi-tuner. At that time, I had an outdated Pascal generation graphics card and it was one of the few tools that ran on it. After a while, I got access to a video card with 24 GB of vram. I thought I was on a horse. I got a big toy to play with the big boys. I've trained everything on diffusion-pipe. But ltx-2 came out and everything was fucking broken, since the official 24 GB training code did not start... I thought that the end had come and my hobby of teaching nice lora would end, but he appeared. The musubi-tuner fork has appeared! Oh God save the one who made this amazing fork for learning ltx! He's just awesome. It allowed me to train the full ltx-2.3 model on 24GB vram at 640x640!

LTX 2.3 vs LTX 2

Although the training dataset was exactly the same for both versions, I got different results. In terms of video generation, the models are close and have similar artifacts, but it seems to me that the sound has improved significantly. In the ltx-2 version, I couldn't get an oiia song without breaking the video. But the ltx-2.3 version coped with this task. The oiia song is clearly audible during generation. There was one video of a piano in the training data and it bit into lora, and when generating cats, this piano is clearly audible. Perhaps it was necessary to change the promt somehow and explicitly indicate that the piano was playing there, but the effect turned out to be funny.

Training

The list of commands that I used to prepare the data

dataset_oiia.toml

[general]
num_repeats = 1
caption_extension = ".txt"
batch_size = 1
enable_bucket = true
bucket_no_upscale = false
cache_directory = "oiia_with_audio/cache"

[[datasets]]
resolution = [640, 640]
video_directory = "oiia_with_audio"
target_frames = [41, 49, 57, 65, 73]
source_fps = 24.0
target_fps = 24.0

python ltx2_cache_latents.py --dataset_config dataset_oiia.toml --ltx2_checkpoint ltx-2.3-22b-dev.safetensors --device cuda --vae_dtype bf16 --ltx2_mode av --ltx2_audio_source video

python ltx2_cache_text_encoder_outputs.py --dataset_config dataset_oiia.toml --ltx2_checkpoint ltx-2.3-22b-dev.safetensors --gemma_root gemma-3-12b-it-qat-q4_0-unquantized --gemma_load_in_4bit --device cuda --mixed_precision bf16 --batch_size 1 --ltx2_mode av

and the training parameters themselves

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 ltx2_train_network.py --mixed_precision bf16 --dataset_config dataset_oiia.toml --gemma_load_in_8bit --gemma_root gemma-3-12b-it-qat-q4_0-unquantized --separate_audio_buckets --ltx2_checkpoint ltx-2.3-22b-dev.safetensors --ltx_version 2.3 --ltx_version_check_mode error --ltx2_mode av --fp8_base --fp8_scaled --blocks_to_swap 10 --sdpa --gradient_checkpointing --learning_rate 1e-4 --optimizer_type AdamW8bit --network_module networks.lora_ltx2 --network_dim 32 --network_alpha 32 --timestep_sampling shifted_logit_normal --max_train_steps 6000 --save_every_n_steps 200 --output_dir oiia_ltx23_v03 --output_name ltx23_lora_v03

I tried using different optimizers and lr, but I liked the result the most on these parameters. And yes, it trained on a graphics card with 24 GB of vram :)