I tried lots of different vision models when trying to do image captions for model training. However, most of the vision models are trained for model training image tagging purposes. So I decided to train on my own.
The PromptGen model is finetuned based on Microsoft Florence-2 caption model, much focused on model tagging. A new <GENERATE_PROMPT> instruction is trained so that the model can perform image tagging just like WD14, but since the model is trained based on SD prompts from civitai, it is more close to how we normally will prompt in SD. <DETAILED_CATION> and <MORE_DETAILED_CAPTION>is also enhanced in the v0.9 release so it can handle uncensored image content compared to Florence-2's original caption abilities.
Feel free to try it out and I've also created a ComfyUI node called ComfyUI MiaoshouAI Tagger so that now you can use the nodes to do batch image captioning directly in ComfyUI or even do frame captioning for videos.
You can use a workflow like below to make a Tagger node as an input to your Prompt Text Encoder. Tagger supports batch input so it can tag multiple images or even each frame in the video if that is what you want to do. The prefix/suffix text box allows you to add text like "masterpiece, best quality" in front/end of your generated caption so that your i2i process can be fully adapted and automated.

There 3 modes you can use the tagger: tags, simple, detailed, and it will give you captions in different length and format.


There is also another node called Saver which allows you to save your captions. So now you can use the Saver node to tag your image folders in ComfyUI. If you want to caption what the scene is about and then followed by tags, you can do so by connecting a tagger in simple/detail mode, and then another tagger in tags mode, like the example given blow:

Combined with the prefix/suffix, the new ComfyUI version added a textbox called replace_tag, so that you can replace any text you want in the generated result. The replace_tag should be in the following format: search_text1:replace_text1;search_text2:replace_text2...
So for example, if you want to replace a woman in the text with 1girl, then you can put a woman:1girl in replace_tag and the output tag will use 1girl in the result instead of a woman.
Anyways, give it a try and have fun playing around with it. Leave me comments for feedbacks : )

