Cartoonishly Eaten - Wan2.1 14b T2V (use VACE for I2V/FLF) - LTX-2 Showcase
LTX-2 Eat
LoRA that gobbles everyone! (Now with sound (really, turn it on on examples))
Basically, you start the video with a subject. Then, suddenly, the camera zooms out revealing that the subject is now miniaturized and then another character steps up and eats them cartoonishly and non-graphically.
This is my fifth LTX-2 LoRA (published globally). Now, this is the start of porting my legacy Civitai loras from Wan to LTX-2.
This LoRA is best working with first-last frame, however start frame may be sufficient if you describe the other subject well. Beware, FLF inherits all LTX-2's flaws and it can do slideshow-like things from time to time (Idk why it spawns the first and the end frame at the end, best way is to simply cut it). Easter egg: characters can devour themselves in a loop if you set the first and the last frames the same pictures.
In contrast to all my previous LTX-2 LoRAs, this one was superhard to train. With CREPA, TREAD, FFN unfreeze, higher rank, Prodigy, the loss didn't lower much and even showed signs of divergence (initially stable loss curve eventually progressing to insanely frequent oscillations without decline). Needless to say, all I could see was pure body horror. With the tongues, the hands themselves being eaten, distorted limbs, etc.
Then I remembered that for sharper results in the case of high oscillations not MSE, but Huber loss is needed. I used scheduled Huber loss (exponential) and it much stabilized the loss curve, producing the much needed downturn at last. Interestingly, this loss choice caused the CREPA semantic regularization loss curve's shape not be just a monotonous sigmoid and even have smooth hills;.
Warning: because deep features CREPA or TREAD was used, some of the videos might have slightly washed out feel. If you experience it, try adding vivid colors to the positive prompt, and things like washed out, gray; to the negative prompt, and also if the start images are themselves vivid, it will go much better.
The runtime for this experiment totaled 5 hours (and five failed attempts, ranging up to 8 hours). The hardware used for training was 1x5090, with zero blocks swapped, ~4 s/it.
The dataset consists of 6 organic video fragments (repeated 2 times), which the original LoRA was trained on, plus 47 picked Wan2.2 generations made with that LoRA applied. Overall, the final checkpoint was picked at 4000 optimization steps. I recommend checking out the Huber loss curve over the steps, its steady decline looks awesome and unusual for diffusions.
The SimpleTuner training and dataset configs are under config.json and ltx2-multiresolution-eat-t2v-v2.json respectively on Huggingface (kabachuha/ltx2-eat).
The ComfyUI workflows are inside the .mp4 video files or on the same repo.
You can use eat style to trigger the image generation.
However, actually, you shouldn't now, because LTX-2 will add an utterance "eat style" at the beginning of the video. Just describe the action similar to the prompts from the examples and it will do the job! Describing the slurping sounds is also recommended for better experience.
