Timothy and I open-sourced our video dataset curation toolkit. It handles everything before training - scene detection, CLIP-based visual triage, captioning, reference frame extraction, validation, and trainer-specific formatting.
Two things worth knowing about:
Visual triage. Drop a reference image into a folder. Klippbok uses CLIP to find every scene containing that character across hours of raw footage. Tested on Breakfast at Tiffany's - 162 character matches out of ~1700 total scenes. You skip splitting and captioning the 1500 scenes you don't need.
Captioning methodology. Four templates that encode what to omit per LoRA type. Character LoRA captions describe action and setting, never appearance. Style captions describe content, never aesthetics. The model learns visuals from pixels - captions handle context. This is the methodology behind our published models, released as tooling for the first time.
Works with musubi-tuner, ai-toolkit, kohya. Gemini (free), Replicate, or Ollama for captioning. Six documented pipelines. Windows-friendly.
This is the data prep side of Dimljus, the video trainer we're building. Data first.
github.com/alvdansen/klippbok

