Why use Blip2 for captioning? Especially for large datasets, it allows you to create much richer captions requiring much less manual revisions as compared to Blip1. Consider the following examples.
With blip1 using default settings, the generated caption is:
a woman in a skirt and sweater posing for a picture
With blip2 using the
jiwenjiWorkflow.json file attached to the article, the caption is:
blip2_t5 This person is asian. This person is a girl. a woman. The person is normal weight. The subject is posing. The person is smiling.. a black sweatshirt and yellow plaid skirt. The scene is inside a bedroom. The room contains a chair and lamp. The image is realistic. blip2_t5
Not perfect, it fails the
hotdog/not hotdog test by confusing a banana for a lamp, and a side table for a chair, but the captions are clearly of higher quality than blip1.
Different workflows, different captions
a pink computer desk with a pink chair and a computer
blip2_t5 This person is white. This person is a girl. a woman. full body, The person is skinny. no The subject is sitting. The subject is sitting in a chair. The person is smiling.. this person is wearing a pink t-shirt with bunny on it. The scene is inside a bedroom. The room contains a monitor, keyboard, and mouse. The image is realistic. blip2_t5
There are more incorrect tokens in this caption, due to the questions in
jiwenjiWorkflow.json being oriented towards images with human subjects. With Blip2, it is important to create new workflows to affect how captions are generated in a more general way. By using the included
jiwenjiWorkflow_reduced.json , no captions related to human subjects are outputted.
blip2_t5 closeup shot of The scene is inside a bedroom. The room contains a monitor, keyboard, and mouse. The image is realistic. colors: pink. blip2_t5
Overall, this reduced and very basic caption workflow is neither better nor worse compared with blip1. With this result, it would depend on the size of the dataset whether time is better spent manually tagging, or revising and improving the workflow json.
The next sections will tell you how to install and use the Blip2 captioner.
open a terminal window and
git clone https://github.com/tjennings/described
change to the project folder
create a new virtual environment in the project folder(you only have to do this the very first time, next time just activate the environment in step 4)
python -m venv venv
now activate the environment
and install the requirements (only have to do this once)
pip install -r requirements.txt
jiwenjiWorkflow.jsonin the attachments in this article and put the file in the
Running the captioner
Activate the python virtual environment
Change to the workflow directory
run the captioner on your training data, replace the
--pathin this command:
python ..\described.py --path "c:/path/to/the/training/images" --workflow jiwenjiWorkflow.json --suffix ", uniqueToken"
make sure you're in the
/workflowsfolder from step 2 when you run the command
--suffixflag is optional but useful for adding a trigger word.
It will load the blip checkpoint and caption the images. You can watch the progress in terminal and view the caption files as they are generated. It will take significantly more time than captioning with blip1. It will not overwrite captions that already exist.
You may need to modify the workflow and remove irrelevant questions or add new questions to differentiate other aspects of the images. Open
jiwenjiWorkflow.jsonin a text editor and get your hands dirty!
Here is a brief explanation of additional arguments:
--workflow: This is the path to the workflow file that will be used. By default, it is set to use the standard.json5 workflow file in the workflows directory.
--model_name: This option lets you specify the name of the model to be used. You can choose between "blip2_opt", "blip2_t5", and "blip2". By default, "blip2_t5" is used.
--model_type: This option defines the type of the model you want to use. The types are organized by the model names and they include:
For "blip2_opt", possible model types are "pretrain_opt2.7b", "caption_coco_opt2.7b", "pretrain_opt6.7b", "caption_coco_opt6.7b"
For "blip2_t5", possible model types are "pretrain_flant5xl", "caption_coco_flant5xl", "pretrain_flant5xxl"
For "blip2", possible model types are "pretrain", "coco"
By default, it is set to "pretrain_flant5xl".
--path: This option is required. It's the path to the images that you want to be captioned.
--prefix: This is a string that will be applied at the beginning of each caption. By default, it's set to "blip2_t5".
--suffix: This is a string that will be applied at the end of each caption. By default, it's also set to "blip2_t5".
To use these arguments, you would append them to the
described.py command when running it from the command line.
Please share workflows! Figuring out what questions to use in a workflow is the most difficult aspect of using Blip2. Even with this obstacle, the results are far superior to blip1. Happy captioning!
learn how to properly train a model using tensorboard