I've created quite a few style loras now. And in order to stream line the most tedious part: the dataset preparation stage, I created this tool to do it all automatically :)
https://huggingface.co/spaces/comfyuiman/loracaptionertaz
The app was vibe coded with google ai studio build. And I'm changing it every now and then.
So some of the UI or instructions may get outdated over time. Also everytime I update it on hugging face, I have to nuke the whole space and reupload it. So bookmark the URL rather than follow the space in hugging face.
You can also download and run it locally without hugging face spaces.
What this tools does
It allows you organize your dataset, caption it using local or online AI models, and then make refinements in bulk. And finally do a sanity check on the output and export it for training. It supports both images and video captioning. And it connects to comfyui if you want to preview the captioning. It also can use the AI to check the accuracy of the captioning versus the source image. You can also import captions directly into it for further refinement.
How to setup
If you're just using gemini its very simple. Put in your API key.
If you want to use local ai like qwen VL you need to download the setup script:

Once you run the setup script, it will setup a virtual environment and install VLLM and some dependencies. After its setup, you just run the start up command every time you want to use it (no need to install more than one time).
To use QWEN, the easiest way is to get the URL of a qwen VL model and put it here. It will downlaod the model on VLLM server start up (first time only) and you can use the launch command each time to run it. Once running it will communicate with the app intuitively.

if the model is local, or removed from hugging face, just use the "offline feature" and point to the
model's download directory in root/hf_cache/hub/model-name/snapshots/hash# and it will run it regardless if its still on hugging face.
WARNING: USING NSFW MATERIAL WITH GOOGLE GEMINI AI MAY GET YOUR ACCOUNT BANNED. USE QWEN FOR NSFW DATASETS.
How to caption dataset
Simply drag and drop your dataset into the app after selecting your AI provider. Select the images/videos you want to caption. And set your system instructions and prompts (or use my defaults).


You can set some other optional parameters. If you wanna tag characters from a show you can enable character tagging. Or if you want a keyword you can enter that. Or a filename prefix.

Then simply generate the captions (qwen or gemini will start doing them one-by-one).

It will spit out captions like so:

You can then select the captions you want to keep and click download it will match the captions with the images with same file names so its ready for training:

Dataset clean up tools
You can also do the following to check the quality of your captions / dataset:
Accuracy quality check:
Use AI to check your captions to its own read of the image/video. It will give a score out of 5. For example:

You can see a badly captioned dataset got a 1/5, an accurate one got 5/5. This feature is not super tested. So treat it with a grain of salt.
Refine the captions using AI:
You can refine the created captions too. For example:

And now, you can see the AI removed mention of the character's clothing in the captions:

Preview the captions via comfyui integration:
Run your comfyui and input the url of the instance

Upload JSON file of your workflow. The default uploaded one is qwen image. But it works with anything that comfyui has. One note: currently it may not work if you use a "note" node.
(The "secure bridge" you can ignore. I have my comfyui closed off from external connections, if you are the same you can use it to setup a server between comfyui and the app. Otherwise just run as is without this checked.)

Clicking on a dataset will bring a popup. So you can review in detail your dataset. Click "preview" to send a request to comfyui to generate an image/video using your caption.

You can preview how your dataset would be rendered without the lora active. I think its a good way to gauge how accurate your captioning is. You can scroll through each individual data in the set in this popup as well or just select the ones you want and click "preview selected"
Closing thoughts
This app gets changed every now and then. So if you have questions let me know.

