Workflow to generate image descriptions on Apple Silicon Mac

Name: Workflow to generate image descriptions on Apple Silicon Mac
Rating: 0 (0 reviews)
Author: edwios

143

Updated: Dec 29, 2024

tool

comfyui workflow lora vlm

Download (1.61 MB)

Verified: 5 months ago

Other

Details

Type	Workflows
Stats	143 0
Reviews	Positive (6)
Published	Dec 25, 2024
Base Model	Other
Hash	AutoV2 7D73EBB7D5

1 File

About this version

default creator card background decoration

edwios

About

This is a workflow that makes use of multiple image-to-text tools and a LLM to produce the final image descriptions for a batch of images in a folder, and write out the corresponding .txt files.

This is especially helpful when captioning / describing NSFW images for LoRA training or fine-tuning, therefore the choice of the following 3 VLMs:

Florence2, WD1.4 tagger
JoyCaption alpha 2
huihui-ai/Qwen2-VL-7B-Instruct-abliterated

The LLM part for the final composition of the image description is done via the ollama node. I would say it is one of the easiest way to use a local LLM.

You will get amazing results by using a uncensored large model such as the huihui-ai/Llama-3.3-70B-Instruct-abliterated

(abliterated / uncensored models shall be used for both Qwen2-VL and the LLM to achieve the best results, nsfw or not).

Installation

Except for the ComfyUI_Qwen2-VL-Instruct and Comfyui_JC2 nodes, install the missing nodes using the ComfyUI manager.

ComfyUI_Qwen2-VL-Instruct

You will need to use the Qwen2-VL-Instruct node from this fork for this workflow to work:

https://github.com/edwios/ComfyUI_Qwen2-VL-Instruct

This fork incorporated two major changes: Allows Image input same as the other VLM tools and uses the Mac GPU (mps) with Python 3.12 and up to and including PyTorch 2.6.

Comfyui_JC2

You may also want to use this ComfyUI_JC2 fork to utilise the Mac GPU for JoyCaption: Alpha 2.

How to use

Everything you will need to interact with this workflow is on the Left most.

The simplest way to start is to enter the path to the directory containing the images. The results will be written to the same directory and name of the images but with the .txt extension.

Optionally, you can do the followings:

Change the VLM prompt to have Qwen2-VL to focus on a specific aspect of the image or images
Change the LLM prompt such as for better reasoning, or if you want it to write the descriptions in a SFW way (use at least a 70b instruct model for this). [No, this is NOT the same as using a 'safe' model.]

Credits

Credits go to all that contributed to make ComfyUI and all these nodes available for all of us.

Especially ComfyUI_Qwen-VL-Instruct, ComfyUI_JC2, ComfyUI-WD14-Tagger, ComfyUI-Ollama, ComfyUI-Florence2 and Ollama for making these amazing machine learning models available on mps, or at least not force it into a Nvidia only solution.