santa hat
deerdeer nosedeer glow
Sign In

Blip2: Advanced Automatic Image Tagging 带有 blip2 的照片说明



Why use Blip2 for captioning? Especially for large datasets, it allows you to create much richer captions requiring much less manual revisions as compared to Blip1. Consider the following examples.

Example 1:

With blip1 using default settings, the generated caption is:

a woman in a skirt and sweater posing for a picture

With blip2 using the jiwenjiWorkflow.json file attached to the article, the caption is:

blip2_t5 This person is asian. This person is a girl. a woman. The person is normal weight. The subject is posing. The person is smiling.. a black sweatshirt and yellow plaid skirt. The scene is inside a bedroom. The room contains a chair and lamp. The image is realistic. blip2_t5

Not perfect, it fails the hotdog/not hotdog test by confusing a banana for a lamp, and a side table for a chair, but the captions are clearly of higher quality than blip1.

Different workflows, different captions

Example 2:

with jiwenjiWorkflow.json

blip1: a pink computer desk with a pink chair and a computer

blip2 (jiwenjiWorkflow.json): blip2_t5 This person is white. This person is a girl. a woman. full body, The person is skinny. no The subject is sitting. The subject is sitting in a chair. The person is smiling.. this person is wearing a pink t-shirt with bunny on it. The scene is inside a bedroom. The room contains a monitor, keyboard, and mouse. The image is realistic. blip2_t5

There are more incorrect tokens in this caption, due to the questions in jiwenjiWorkflow.json being oriented towards images with human subjects. With Blip2, it is important to create new workflows to affect how captions are generated in a more general way. By using the included jiwenjiWorkflow_reduced.json , no captions related to human subjects are outputted.

Blip2 with jiwenjiWorkflow_reduced.json:

blip2_t5 closeup shot of The scene is inside a bedroom. The room contains a monitor, keyboard, and mouse. The image is realistic. colors: pink. blip2_t5

Overall, this reduced and very basic caption workflow is neither better nor worse compared with blip1. With this result, it would depend on the size of the dataset whether time is better spent manually tagging, or revising and improving the workflow json.

The next sections will tell you how to install and use the Blip2 captioner.


  1. open a terminal window and git clone

  2. change to the project folder cd described

  3. create a new virtual environment in the project folder(you only have to do this the very first time, next time just activate the environment in step 4) python -m venv venv

  4. now activate the environment ./venv/scripts/activate

  5. and install the requirements (only have to do this once) pip install -r requirements.txt

  6. Download the jiwenjiWorkflow.json in the attachments in this article and put the file in the /workflows folder

Running the captioner

  1. Activate the python virtual environment ./venv/scripts/activate

  2. Change to the workflow directory cd workflows

  3. run the captioner on your training data, replace the --path in this command:

    • python ..\ --path "c:/path/to/the/training/images" --workflow jiwenjiWorkflow.json --suffix ", uniqueToken"

      • make sure you're in the /workflows folder from step 2 when you run the command

      • the --suffix flag is optional but useful for adding a trigger word.

  4. It will load the blip checkpoint and caption the images. You can watch the progress in terminal and view the caption files as they are generated. It will take significantly more time than captioning with blip1. It will not overwrite captions that already exist.

  5. You may need to modify the workflow and remove irrelevant questions or add new questions to differentiate other aspects of the images. Open jiwenjiWorkflow.json in a text editor and get your hands dirty!

Here is a brief explanation of additional arguments:

  • --workflow : This is the path to the workflow file that will be used. By default, it is set to use the standard.json5 workflow file in the workflows directory.

  • --model_name : This option lets you specify the name of the model to be used. You can choose between "blip2_opt", "blip2_t5", and "blip2". By default, "blip2_t5" is used.

  • --model_type : This option defines the type of the model you want to use. The types are organized by the model names and they include:

    • For "blip2_opt", possible model types are "pretrain_opt2.7b", "caption_coco_opt2.7b", "pretrain_opt6.7b", "caption_coco_opt6.7b"

    • For "blip2_t5", possible model types are "pretrain_flant5xl", "caption_coco_flant5xl", "pretrain_flant5xxl"

    • For "blip2", possible model types are "pretrain", "coco"

    By default, it is set to "pretrain_flant5xl".

  • --path : This option is required. It's the path to the images that you want to be captioned.

  • --prefix : This is a string that will be applied at the beginning of each caption. By default, it's set to "blip2_t5".

  • --suffix : This is a string that will be applied at the end of each caption. By default, it's also set to "blip2_t5".

To use these arguments, you would append them to the command when running it from the command line.

Please share workflows! Figuring out what questions to use in a workflow is the most difficult aspect of using Blip2. Even with this obstacle, the results are far superior to blip1. Happy captioning!

Next steps?

learn how to properly train a model using tensorboard