Sign In

Effortless Image Captioning that Doesn't Suck NOW EXISTS!

Effortless Image Captioning that Doesn't Suck NOW EXISTS!

If you want to create your own custom AI models, such as LoRAs for Stable Diffusion, you must train the model on a dataset of images, as you may know.

Ideally, you should also have a text description located in the same folder, saved as a text file with the same name as the image, but ending in '.txt", instead of 'webp', jpeg, png, etc..

During training, the text description gets encoded into the model, attached to the vector data, so that Stable Diffusion can use that text (or variations) to generate images later.

Having a good description was something I never spent too much time on. For one thing, training a model on a single object (rather than a style, for example), meant I didn't have to worry about text much, since for style models, you can just put the name of the object in the text file, then when using it later, control the LoRA model's influence with weights (from -2 to 2), rather than with text. But the truth is, I stuck to training object or character LoRAs because captioning and editing the text is my least favorite part of curating datasets. But NO LONGER!

TagUI is 'the best' tool I've ever found for image captioning datasets. I've honestly spent too many hours over the last year searching Github and Huggingface for something like this.

Don't you LOVE IT when the best software tool available is also Free & Open Source? Me too!

The best part about this tool for me was the crazy selection of image captioning models. All the latest models, such as BLIP-2, Vicuna, LLaVA and CogVLM, to name a few.

Note that the first time you use a new model, it will have to download that model, which it does as 'shards' (multi-part files) and it takes FOREVER that first time, then it's much faster, though still quite slow, compared with earlier models I've used. I recommend CogVLM, which took over an hour to caption 90 images, but just read the caption in my screen cap...a no cap cap-cap, if you will. I bet there's not a single person who can say they were putting descriptions like that in their models, before now. Finally, we can make use of the new SDXL text encoding functionality and really improve how our models react to language of a much greater depth and detail. Many of the captions contained emotions, or meaning the image was meant to convey. These new high quality descriptions will provide a new dimension to model training!

Thanks to the devs of TagUI for such a useful tool! Please be sure to add a star to their repo on Github, if you agree with me about their tool:

No need to clone the repo, as they have release candidates, so installs are fully automated: