Sign In

Workflow to generate image descriptions on Apple Silicon Mac

6
87
5
Updated: Dec 29, 2024
toolcomfyuiworkflowloravlm
Type
Workflows
Stats
87
Reviews
Published
Dec 25, 2024
Base Model
Other
Hash
AutoV2
7D73EBB7D5
default creator card background decoration
edwios's Avatar
edwios

About

This is a workflow that makes use of multiple image-to-text tools and a LLM to produce the final image descriptions for a batch of images in a folder, and write out the corresponding .txt files.

This is especially helpful when captioning / describing NSFW images for LoRA training or fine-tuning, therefore the choice of the following 3 VLMs:

  • Florence2, WD1.4 tagger

  • JoyCaption alpha 2

  • huihui-ai/Qwen2-VL-7B-Instruct-abliterated

The LLM part for the final composition of the image description is done via the ollama node. I would say it is one of the easiest way to use a local LLM.

You will get amazing results by using a uncensored large model such as the huihui-ai/Llama-3.3-70B-Instruct-abliterated

(abliterated / uncensored models shall be used for both Qwen2-VL and the LLM to achieve the best results, nsfw or not).

Installation

Except for the ComfyUI_Qwen2-VL-Instruct and Comfyui_JC2 nodes, install the missing nodes using the ComfyUI manager.

ComfyUI_Qwen2-VL-Instruct

You will need to use the Qwen2-VL-Instruct node from this fork for this workflow to work:

https://github.com/edwios/ComfyUI_Qwen2-VL-Instruct

This fork incorporated two major changes: Allows Image input same as the other VLM tools and uses the Mac GPU (mps) with Python 3.12 and up to and including PyTorch 2.6.

Comfyui_JC2

You may also want to use this ComfyUI_JC2 fork to utilise the Mac GPU for JoyCaption: Alpha 2.

How to use

Everything you will need to interact with this workflow is on the Left most.

The simplest way to start is to enter the path to the directory containing the images. The results will be written to the same directory and name of the images but with the .txt extension.

Optionally, you can do the followings:

  • Change the VLM prompt to have Qwen2-VL to focus on a specific aspect of the image or images

  • Change the LLM prompt such as for better reasoning, or if you want it to write the descriptions in a SFW way (use at least a 70b instruct model for this). [No, this is NOT the same as using a 'safe' model.]

Credits

Credits go to all that contributed to make ComfyUI and all these nodes available for all of us.

Especially ComfyUI_Qwen-VL-Instruct, ComfyUI_JC2, ComfyUI-WD14-Tagger, ComfyUI-Ollama, ComfyUI-Florence2 and Ollama for making these amazing machine learning models available on mps, or at least not force it into a Nvidia only solution.