Sign In

Taz's Ultimate Image/Video Easy Captioning Tool : Gemini + Qwen VL

4

Dec 26, 2025

(Updated: 3 months ago)

tool guide
Taz's Ultimate Image/Video Easy Captioning Tool : Gemini + Qwen VL

I've created quite a few style loras now. And in order to stream line the most tedious part: the dataset preparation stage, I created this tool to do it all automatically :)

https://huggingface.co/spaces/comfyuiman/loracaptionertaz

The app was vibe coded with google ai studio build. And I'm changing it every now and then.

So some of the UI or instructions may get outdated over time. Also everytime I update it on hugging face, I have to nuke the whole space and reupload it. So bookmark the URL rather than follow the space in hugging face.

You can also download and run it locally without hugging face spaces.

What this tools does

It allows you organize your dataset, caption it using local or online AI models, and then make refinements in bulk. And finally do a sanity check on the output and export it for training. It supports both images and video captioning. And it connects to comfyui if you want to preview the captioning. It also can use the AI to check the accuracy of the captioning versus the source image. You can also import captions directly into it for further refinement.

How to setup

If you're just using gemini its very simple. Put in your API key.

If you want to use local ai like qwen VL you need to download the setup script:

image.png

Once you run the setup script, it will setup a virtual environment and install VLLM and some dependencies. After its setup, you just run the start up command every time you want to use it (no need to install more than one time).

To use QWEN, the easiest way is to get the URL of a qwen VL model and put it here. It will downlaod the model on VLLM server start up (first time only) and you can use the launch command each time to run it. Once running it will communicate with the app intuitively.

image.png

if the model is local, or removed from hugging face, just use the "offline feature" and point to the

model's download directory in root/hf_cache/hub/model-name/snapshots/hash# and it will run it regardless if its still on hugging face.

WARNING: USING NSFW MATERIAL WITH GOOGLE GEMINI AI MAY GET YOUR ACCOUNT BANNED. USE QWEN FOR NSFW DATASETS.

How to caption dataset

Simply drag and drop your dataset into the app after selecting your AI provider. Select the images/videos you want to caption. And set your system instructions and prompts (or use my defaults).

image.png

image.png

You can set some other optional parameters. If you wanna tag characters from a show you can enable character tagging. Or if you want a keyword you can enter that. Or a filename prefix.

image.png

Then simply generate the captions (qwen or gemini will start doing them one-by-one).

image.png

It will spit out captions like so:

image.png

You can then select the captions you want to keep and click download it will match the captions with the images with same file names so its ready for training:

image.png

Dataset clean up tools

You can also do the following to check the quality of your captions / dataset:

Accuracy quality check:

Use AI to check your captions to its own read of the image/video. It will give a score out of 5. For example:

image.png

You can see a badly captioned dataset got a 1/5, an accurate one got 5/5. This feature is not super tested. So treat it with a grain of salt.

Refine the captions using AI:

You can refine the created captions too. For example:

image.png

And now, you can see the AI removed mention of the character's clothing in the captions:

image.png

Preview the captions via comfyui integration:

Run your comfyui and input the url of the instance

image.png

Upload JSON file of your workflow. The default uploaded one is qwen image. But it works with anything that comfyui has. One note: currently it may not work if you use a "note" node.

(The "secure bridge" you can ignore. I have my comfyui closed off from external connections, if you are the same you can use it to setup a server between comfyui and the app. Otherwise just run as is without this checked.)

image.png

Clicking on a dataset will bring a popup. So you can review in detail your dataset. Click "preview" to send a request to comfyui to generate an image/video using your caption.

image.png

You can preview how your dataset would be rendered without the lora active. I think its a good way to gauge how accurate your captioning is. You can scroll through each individual data in the set in this popup as well or just select the ones you want and click "preview selected"

Closing thoughts

This app gets changed every now and then. So if you have questions let me know.

4