Quick and Dirty Guide to Testing LTX-2 on Runpod

Yesterday LTX-2 was released with full-throated support from Nvidia and ComfyUI. Unfortunately getting it up an running is a bit more finicky than they planned. I've put together a Runpod template so you can test it out if you're curious. I don't recommend potentially messing up a local ComfyUI install for this.

This is bleeding edge so be prepared for weirdness in process. Some of this may change, but is true as of the writing on 2026/1/7.

VRAM Warning

This model is super memory hungry. While the model itself isn't very large, it requires more than 32gb of ram to produce more than 4 seconds of 720p video with the full model. The 5090 can only do 4 seconds. It will conceptually do up to 20 seconds, but you can safely do 10 with an H100 or other high VRAM card. Use the fp8 model if you want to use a cheaper card, but usual caveats: You're limiting the possibility space when the model is quantized. So depends what you want to test! The official requirements are as follows:

Minimum Requirements

GPU: NVIDIA GPU with a minimum 32GB+ VRAM - more is better
RAM: 32GB system memory
Storage: 100GB free space
CUDA: 11.8 or higher
Python: 3.10 or higher

Recommended Configuration

GPU: NVIDIA A100 (80GB) or H100
RAM: 64GB+ system memory
Storage: 200GB+ SSD
CUDA: 12.1 or higher

Why LTX-2?

Speed-to-Quality. Sound.

Why RunPod?

From my guide: "If you're not using it, please use my link here. We'll both get some free credit. But why use it at all? GPU's are extremely expensive these days, and fast GPU's even more so. Once you break down the usage, the cost of hardware plus electricity doesn't make much sense. (I'm paying less than a $1 an hour for access to an RTX 5090.) You're not locked in to your investment when you rent. When the next gen comes, you'll get a faster card for less than you would have paid upfront. Chances are you're not running your card 100% of the time, so for most folks, buying a card is absolutely the wrong answer in this market.

I've got some cost estimates on the workflow page, if you'd like to read a bit more."

This is expecially true for LTX-2, because of it's memory hunger.

The HX100 SXM is ~$2.71 per hour if you really want to take this for a ride and see more videos more quickly, which has 48gb of ram. (As a point of reference, 5 second clips at 1280x720 were finishing in about 80+ seconds when I was testing yesterday.)

Getting Started (Newbie)

If you are new to Runpod, you can use my guide here with the following changes:

3) Select the LTX-2 Testing template.

4) / 5) You can use less storage (already done by default), as there are effectly no LoRA's yet (tho I did include the demo camera dolly LoRA in the downloads).

6) Under the edit button for Environmental Variables, you can set the models you want to download to true. Mind the model sizes and the container disk size. The template has all 4 models available: dev and distill as well as the fp8 quantizations. You can pick which you want to download in the template settings. (They recommend the distilled model, I'm guessing it's more forgiving on pompting.)

9) The workflows folder will contain the official examples. You need to correct the default model paths for the loading nodes. Additionally, bypass the "Enhancer" node, then it should work. You will need to either download the Gemma 3 model (see below), or replace the text encoder node. (See below.)

Getting Started (Runpod Savvy)

Pick your GPU (at least CUDA 12.1) in a favorable region.
Pick the LTX-2 Testing template.
Adjust the Environmental Variables for the model you want.
Open either the T2V or I2V template, adjust the model paths on the loaders, and bypass the "Enhancer" node.

(Optional) Fix for "Enhancer" node

Looks like someone made a patch for it, if you want to utilize it. I may update the image to apply the patch later, but in the interest of time as a not critical fix, I'd just like to make you aware of it. (Jupyter Notebook can be used to do this.) I haven't test this this, so YMMV.

Some quick (opinionated) notes...

It's very fast for the resolution. H100 can do a 10 second 1080p video in like 7 or 8 minutes. Even faster gains for lower resolutions and lengths.
The sound is weird. A little better than MMAudio, but not dramatically? Lipsync is decent, but the sound quality and intonation are often very un-nautral. It loves to add weird pop/bollywood background music if you are not careful in your prompting.
The quality for T2V and I2V is good at a glance. The motion get's janky and you get some horror shows depending on your prompt, like older AI video models, but it is reasonably high fidelity.
I found prompt coherence funky.
Character coherence is both better and worse than Wan 2.2 - better end to end, has trouble if a character leaves frame and the like.
It doesn't know naughty things. Like Wan, it's uncensored but "censored via omission". it vaguely understands nipples, but is stubourn or incorrect about anything else. (I tried extensively.)
The text to speech is also uncensored, as one should expect.

I was testing with the full dev model. I'll save further thoughts for a different space.

Resources

If you're hoping to do it yourself or troubleshoot, you'll want to look here.

There's also a promting guide here.

Gemma 3 model?

Weird quirk: The official workflows use a version of Gemma 3 that requires a HuggingFace account so you can agree to terms to download it from there. There's a different safetensors files that you can get from a ComfyUI repo which can be used, but must use a different node than the template provided. (The template will download this version automatically, but you'll need to use an authenticated Hugging Face commandline tool to download the proper file from the Google repo. (I've not investigated whether they produce meaningful differences or if it's just a packaging difference.)

Additionally the example workflows' "Enhance" node doesn't work for me. This is just a LLM pass at altering your positive prompt. Ergo, you can just bypass it.

How do I use the included Gemma 3 model?

Because this just includes the default templates, you can make a small change to use this: Add a "LTX Audio Video Text Encoder Loader" node (LTXAVTextEncoderLoader), select the safetensors file, and then attach it to the yellow CLIP inputs. That should do it!

The green node (LTXV Audio Text Encoder Loader) at the bottom should replace the one highlighted in pink (LTX Gemma 3 Model Loader) if you use the included Gemma 3 model.