This article details a straightforward Text-to-Video (T2V) workflow using LTX 2.3, optimized for users with limited GPU resources. It's specifically tested and confirmed to work on a system with 10GB of VRAM and 32GB of RAM. While ComfyUI’s memory usage fluctuated between 8GB and 10GB during testing, this was due to running Windows 11 with a dual-monitor setup and having other applications (video playback and Chrome) active concurrently. Under these conditions, generating a 10-second video took approximately 7 minutes.
Workflow Components & Checkpoints :
The following checkpoints, VAEs (Variational Autoencoders), and CLIP models were utilized in this workflow:
Unet (GGUF): https://huggingface.co/unsloth/LTX-2.3-GGUF/blob/main/ltx-2.3-22b-dev-Q4_K_M.gguf
Video VAE: https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_video_vae_bf16.safetensors
Audio VAE: https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_audio_vae_bf16.safetensors
Dual CLIP:
Latent Upscale Model: https://huggingface.co/Lightricks/LTX-2.3/blob/main/ltx-2.3-spatial-upscaler-x2-1.1.safetensors
Feedback & Questions
I hope this workflow proves helpful for those working with limited GPU resources. If you have any questions or encounter any issues, please leave a comment below
