Sign In

AI Video Clipper & LoRA Captioner v5.0 - The Ultimate Multi-Modal Dataset Tool

0

Apr 3, 2026

(Updated: a month ago)

tool guide
AI Video Clipper & LoRA Captioner v5.0 - The Ultimate Multi-Modal Dataset Tool

Hey Civitai Community! 👋

If you are training LoRA models (especially for modern video architectures like LTX-Video) or building complex image/video datasets, you know how tedious cutting and captioning can be.

I want to share an open-source tool I've been developing that fully automates this process locally: AI Video Clipper & LoRA Captioner. We just released v5.0, and it is a massive leap forward, evolving into a fully modular, multi-modal pipeline.

Why this tool? Most auto-captioners just look at frames. Our tool listens and sees. It merges audio transcription, environmental sound analysis, and visual description into one perfectly aligned dataset.

🔥 Core v5.0 Features:

  • True Multi-Modal Pipeline: Uses WhisperX for exact word-by-word speech timestamping, Qwen2-Audio (in FP8) to describe environmental sounds/music, and Qwen3-VL / Gemma3 for deep visual descriptions. The final .txt output is a single, naturally flowing paragraph.

  • LTX Hard Cut Mode: Essential for LTX-2 training! It forces strict clip durations (e.g., exactly 5.0s) from the start of detected speech. It includes a custom "Global Word Search" that ensures your text file only contains the exact words spoken in that timeframe. Zero text hallucinations at the cut points!

  • Modular GGUF Support: Full support for any GGUF+mmproj pair. Pick the exact quants that fit your VRAM!

  • Bulletproof Memory Management: A strict 3-Stage Pipeline (Speech -> Audio -> Vision) that aggressively clears VRAM between steps, preventing OOM errors even on heavy processing loads

  • Bulk Processing 2.0: Have your clips already cut? Use the Bulk mode to process entire folders with independent toggles for vision and speech.

Hardware Ready The tool is built for performance. It supports multithreaded downloading, runs portable via uv, and includes under-the-hood patches to fully support NVIDIA's Blackwell architecture (RTX 5090).

Take the headache out of your dataset creation. It's completely free and open-source.

https://github.com/cyberbol/AI-Video-Clipper-LoRA

0