i couldn't find a captioning tool that runs qwen vl 3 and is simple to use and has some features i need, so i vibecoded my own one.
it now has exactly the features i actually need and is actually quite a timesaver.
qwen vl 3 is the first actually good vision model i tried that i can easily run locally (IMO cogvlm, llava, moondream etc etc all suck)
so if you happen to be looking for something like this you can find it here:
https://github.com/realjoschek/joschekscaptioner
should be realtively easy to install in linux, no idea if it works in windows. if you have problems installing i probably cant help you.
If you find theres a feature missing, let me know,
if you find a bug: drop it into a good llm and ask it to fix pls. Thats all i would do as well XD
Have a good one!


