Sign In

LTX2 Img and Audio to Video

Type

Workflows

Stats

143

0

Reviews

Published

Mar 2, 2026

Base Model

LTXV2

Hash

AutoV2
50359B1D0B

A comprehensive production-grade pipeline designed for the LTX-2 model. It specializes in generating high-fidelity video by combining a source image and an audio track to create synchronized content, such as music videos with lip-syncing and dance.

Key Features & Architecture

The workflow is organized into distinct logical stages using subgraphs to manage complexity and optimize hardware resources:

  • Multimodal Input Processing:

    • Image Handling: Uses ImageResizeKJv2 to prepare a source image, which acts as the visual foundation for the video.

    • Audio Integration: Employs a VHS_LoadAudioUpload node to bring in external audio files, which guide the timing and motion of the generation.

  • Dual-Stage Sampling Pipeline:

    • Stage 1 (Initial Generation): Focuses on establishing the core motion and structure.

    • Stage 2 (Refinement): A secondary pass that refines the video and audio latents for higher quality.

  • VRAM Optimization:

    • Gemma API Text Encode: Instead of loading the massive Gemma-3 12B model locally, this workflow uses an API-based text encoder. This significantly reduces local VRAM requirements, allowing the workflow to run on GPUs with as little as 12GB to 16GB.

  • Creative Controls:

    • Camera LoRAs: Includes dedicated slots for LTX-2 Camera Control LoRAs (e.g., Dolly Left), allowing for precise cinematic movement.

    • Latent Upscaling: Incorporates a spatial upscaler to enhance the resolution of the final output.