(Note: I just found out that Claude3 Haiku might stop working after processing more than 100 pics. Not sure why yet. Also, Gemini Flash 1.5 might not return any data when dealing with pics that have more than 600k pixels.
我刚刚发现claude3 haiku在连续处理超过100张图片后会有概率被强制中断,目前尚不清楚原因。其次,gemini flash 1.5在处理超过60万像素的图片时,会有概率无法返回任何数据)
如果你的电脑上没有安装pillow,请先打开命令提示符,输入pip install pillow安装pillow,安装完成后再启动脚本
If you haven't installed Pillow, please open the command prompt first and type 'pip install pillow' to install Pillow before run "gui.py"
__________________________________________________________________________________________________
UpdateV1.1_2024.08.04:
1.我编撰了一份更为详细的instruction,现在已将其更新到了脚本里。我在claude3.5上对其进行了测试,它可以让模型输出详细但不啰嗦的内容。
2.更新了一个新模型:Yi-vision,经过我的测试,它的性能大幅超越gemini pro vision1.0,同时接近甚至超越了gemini flash 1.5,但价格和gemini pro vision1.0相似,并且道德限制相比其他模型要宽松的多。实际使用价格(图像像素60万且输出tokens数量适中)为$0.18/1k图片。
3.更新了随机时间间隔逻辑,新的时间间隔采用正态分布取值。
4.添加了新的检查机制,如果用户在脚本运行过程中手动关闭了项目,那么下次运行时会继续处理未完成的图片,而不会重复处理已经完成的图片。
5.一些其他的修改和优化。
1. I've made a more detailed instruction and updated it in the script. I tested it on Claude 3.5, It enables the models to output detailed but not verbose content.
2. A new model has been updated: Yi-vision. Based on my test, its performance significantly surpasses Gemini Pro Vision 1.0 and approaches or even exceeds Gemini Flash 1.5. However, its price is similar to Gemini Pro Vision 1.0, and its ethical constraints are much more relaxed compared to other models. The actual usage cost (for images with 600,000 pixels and moderate output tokens) is $0.18 per 1,000 images.
3. The random time interval logic has been updated. The new time intervals are now based on a normal distribution.
4. A new checking mechanism has been added. If users manually close the project during script execution, the next run will continue processing unfinished images rather than reprocessing completed ones.
5. Various other modifications and optimizations have been implemented.
___________________________________________________________________________________________________
UpdateV1.1_In-context few-shot learning_2024.08.10:
本次更新加入了一个新功能—短上下文学习,这个功能可以让大模型通过学习少数示例快速适应用户的需求,极大地增强了使用体验。详细介绍在此页面:OpenRouter自动打标器的重大更新 || The major update of "Automated tagger with openrouter api" | Civitai
This update introduces a new feature - In-context few-shot learning. This function allows large language models to quickly adapt to users' needs by learning from a small number of examples, greatly enhancing the users' experience. Detailed information can be found on this page:OpenRouter自动打标器的重大更新 || The major update of "Automated tagger with openrouter api" | Civitai
____________________________________________________________________________________________________
UpdateV1.2_2024.08.11:
本次更新将模型服务提供商从单一的OpenRouter扩展到了所有兼容openai API接口的模型提供商。现在,只要你有兼容openai接口的api,都可以使用这个脚本对数据集进行自动打标。
This update has expanded the model service provider from OpenRouter to all model service providers that support OpenAI competible API. Now, as long as you have an Openai competible API key, you can use this script.
新的脚本的GUI界面如下:
The GUI interface of the new script is as follows:
使用说明:
1.V1.2允许用户填写自定义的API URL,比如说你从OpenRouter那里购买了API服务,那你就需要在API URL输入框中填写:https://openrouter.ai/api/v1/chat/completions(注意:必须要在API URL的末尾添加/v1/chat/completions)
2.之后,你需要在API Key输入框中填写你的API,你购买的哪家服务商的服务,就填写哪家服务商的API
3.V1.2去除了模型菜单栏,所以你需要手动填写模型名称。这里需要注意,填写的模型名称是那种能被服务商识别的标准模型名词,比如说你使用的是OpenRouter的API,现在你想使用gpt4o对数据集打标,那么你就必须填写openai/gpt-4o,而不是GPT4o、Gpt4o或者GPT4O这种不能被服务商识别的模型名。即使对于同一个模型来说,在每个服务商那里的标准名称可能都是不同的。标准的模型名可以去你购买API的服务商那里找到。
4.其它的部分基本没有变化,参考V1.1的使用方法就可以
短上下文学习版也进行了相同的更新,具体脚本和使用说明可以去这个页面下载:OpenRouter自动打标器的重大更新 || The major update of "Automated tagger with openrouter api" | Civitai
Usage instructions:
1. V1.2 allows users to add a customizable API URL, for example, if you purchased API service from OpenRouter, you need to enter: https://openrouter.ai/api/v1/chat/completions
(Note: You must add /v1/chat/completions at the end of the API URL)
2. Then, you need to fill in your API key in the API Key input box.
3. V1.2 has removed the model menu bar, so you need to manually fill in the model name. It's important to note that the model name you fill in should be the standard model name that can be recognized by the service provider. For example, if you're using OpenRouter's API and want to use gpt4o to tag the dataset, you must fill in openai/gpt-4o, not GPT4o, Gpt4o, or GPT4O, which cannot be recognized by the service provider. Even for the same model, the standard name may be different for each service provider. You can find the standard model names from the service provider where you purchased the API.
4. Other sections have basically not changed, you can refer to the usage method of V1.1.
In-context few-shot learning version has the same update. You can download it and read the instructions from here: OpenRouter自动打标器的重大更新 || The major update of "Automated tagger with openrouter api" | Civitai
前言 || Preface
就在前几天,我撰写了一篇关于如何使用MiNiCPM-Llama3对数据集进行标注的文章。使用本地LLM有很多好处,比如高隐私性、低成本等等,但是如果你的GPU无法胜任运行LLMs的工作,或者你想使用目前最强大的闭源LLMs为数据集进行精准标注的话,那么使用在线LLMs就是非常必要的了。在接下来的文章中我会讨论两件事—如何利用python脚本使用OpenRouter(一个包含很多LLM的网站)的API接口对你的数据集进行自动标注,以及目前这些流行的闭源多模态LLM的图形处理能力和性价比。
接下来我要讨论的Python脚本利用了OpenRouter的API接口,为了避免沾染不必要的为OpenRouter打广告的嫌疑,我就不把它的网站放在我的文章里了,我写这篇文章的主要目的是为了方便目前正在使用OpenRouter的朋友,我使用OpenRouter的原因也是因为通过OpenRouter的统一API接口可以很方便地调用各种LLM。如果你对它感兴趣,可以在网上自行搜索和了解。
Just a few days ago, I wrote about how to use MiNiCPM-Llama3 to automatically tag your datasets. Using local LLMs has lots of perks, like better privacy and lower costs. But if your GPU can't handle running LLMs, or you want to use the best closed-source LLMs for accurate tagging, then online LLMs are the way to go. In this article, I'll talk about two things - how to use my Python script with OpenRouter's API (a website with many LLMs) to automatically tag your dataset, and how good these popular closed-source multimodal LLMs are at processing images, plus their value for money.
The Python script I'm about to talk about uses OpenRouter's API. To avoid looking like I'm advertising for OpenRouter, I won't put their website in my article. I'm mainly writing this to help people already using OpenRouter, and I use it because its unified API makes it easy to access various LLMs and the pricing is the same as those LLMs' official sites'. If you're interested in it, you can look it up online yourself.
使用说明 || Usage of the script
首先,你的电脑上必须安装了Version3.8或以上的Python。然后下载我提供的脚本Tagger_With_OpenRouter.zip,之后将其解压缩,解压缩完成后你将看到这个:
First, make sure you have Python 3.8 or newer on your computer. Then, download the Tagger_With_OpenRouter.zip I provided and unzip it. After unzipping, you'll see this:
点击gui.py后就可以启动这个脚本了。首次启动时如果你的电脑上缺少numpy, requests等库,那么就需要保证网络畅通。稍等片刻后,等它们安装完成,你就会看到如下的GUI界面:
Click on gui.py to start the script. If your computer doesn't have libraries like numpy or requests, make sure you're connected to the internet. Wait a bit for them to install, and you'll see this GUI:
它的使用方法是这样的:
1.当你打开这个GUI后,首先点击“浏览”键,选择你的数据集。
2.之后,在“Temperature(0-2)”这一栏填写温度。这个值越低,LLMs的输出就越稳定且越遵循你的命令,但是回答也越死板;这个值越高,LLMs的输出就越不稳定且越不遵循你的命令,但是其回答也会越来越有创意。我一般将其设定为1.0。(根据我的测试,对于gemini flash 1.5来说,0.85的温度值更好)
3.在“模型”这一栏可以选择你想使用模型,关于如何选择合适的模型,我会在文章最后讲解。
4.在"API key"这一栏填写你从OpenRouter获取的API,只有第一次使用这个脚本时需要填写API Key。(***建议经常更换新的API key,废弃使用的API key应该去OpenRouter官网将其注销,以保证API key不被窃用)
5.因为目前大部分LLMs可接收的图片的最大像素一般是100万,且使用LLMs的花费和输入的图片的分辨率息息相关,所以如果你的图片的分辨率太大,就必须对其进行压缩。我使用了LANCZOS对图片进行压缩,你可以在“图片像素”这一栏填写你想要将图片压缩到多少分辨率以内,可填写的范围是[400000,1000000],小于400000或大于1000000都会让程序报错。一般来说,对于GPT4o和Claude3.5,你只需要提供40万像素的图片就能保证它们的图像识别能力了,并且当你选择好具体要使用的模型后,脚本也会给你提供相关建议。(我创建了一个temp文件夹,用来暂时储存被压缩的图片,图片压缩过程不会对数据集产生任何影响)
***如果你不在这里填写任何东西,那么脚本就会直接将你数据集中的图片上传到OpenRouter。如果你的数据集中的图片的像素太大(大于100万像素),那么有可能无法返回任何数据或返回错误数据。
6.最后你需要填写instruction。简单来说,这一步是为了告诉LLMs应该如何描述你的图片,我已经在脚本中填写了我常用的instruction,你也可以将其更换为你自己的编写的instruction。instruction也只需要在第一次使用时填写。
7.当你将全部参数填写完成后,点击“运行”,脚本就会开始利用你选择的模型对数据集里的图片进行自动标注。每标注好一张图片,这个脚本便会在你的数据集中创建一个和被处理图片同名的txt文件,然后将tags储存在这个txt文件中。
8.有一点需要说明一下,为了防止OpenRouter将其识别为自动化脚本,我将传送每张图片的时间间隔设定为了一个随机数。这个时间间隔在3秒内的概率是50%,6秒内的概率是80%。
Here's how to use it:
1. When you open the GUI, first click "Browse" to choose your dataset.
2. Fill in the "Temperature(0-2)" field. Lower values mean more stable and obedient LLM outputs, but less creative. Higher values mean less stable and obedient outputs, but more creative. I usually set it to 1.0.(Based on my testing, a temperature value of 0.85 works better for Gemini Flash 1.5)
3. Choose the model you want to use in the "Model" field. I'll explain how to pick the right model at the end of the article.
4. Enter your OpenRouter API key in the "API key" field. You only need to do this the first time you use the script. (It's a good idea to change your API key often and deactivate old ones on the OpenRouter website to keep them safe.)
5. Most LLMs can only handle images up to 1 million pixels, and the cost depends on the image resolution. If your images are too big, you need to compress them. I use LANCZOS for compression. In the "Image pixels" field, enter the resolution you want to compress to, between 400,000 and 1,000,000 pixels. For GPT4o and Claude3.5, 400,000 pixels is usually enough. The script will give you suggestions based on the model you choose. (I made a temp folder to store compressed images temporarily. So the compression won't affect your dataset)
***If you don't put anything here, the script will just upload the pics from your dataset straight to OpenRouter. If the pics in your dataset are too big (>1000k pixels), you might not get any data back, or you might get wrong data.
6. Lastly, fill in the instruction. This tells the LLM how to describe your images. I've already filled in my usual instruction, but you can change it to your own. You only need to do this the first time you use the script.
7. After filling in all the parameters, click "Run". The script will start automatically labeling the images in your dataset using the model you chose. For each image processed, the script creates a txt file with the same name in your dataset and stores the tags there.
8. To prevent OpenRouter from flagging this as an automated script, I set random time intervals between sending each image. There's a 50% chance it'll be within 3 seconds, and an 80% chance within 6 seconds.
脚本运行过程如下:
The script runs like this:
程序执行完毕后的数据集如下:
此时,图片的tags就已经被储存在与其同名的txt文件中了。
At this point, the image tags are stored in the txt file with the same name as the image.
模型评估 || Models Assessment
我根据这些模型的性能为他们划分了如下几个等级,并配有详细介绍:
Level0: Claude3.5 Sonnet, GPT4o, Gemini pro 1.5
介绍—他们拥有最强大的图像感知能力、最准确的描述能力以及最丰富的语料库,在描述图像时几乎不会犯错,适用于对图片数量较少(500张以下)的数据集标注任务或需要极高精准度的数据集标注任务(Gemini pro1.5的能力相较Claude3.5 Sonnet和GPT4o稍差)。
实际使用价格(输入图像的像素为40万,输出的tokens数量适中)—Claude3.5 Sonnet: 4美元/1k images, GPT4o:6.5美元/ 1k images, Gemini pro 1.5: 2.8美元/1k images
根据我的测试,Claude3.5 Sonnet和GPT4o的图像处理能力大致相同,但根据我的费用清单,GPT4o的价格在Claude3.5 Sonnet和Gemini pro 1.5的1.5倍以上,所以我更推荐Claude3.5 Sonnet和Gemini pro 1.5
Level1:Gemini Flash 1.5
介绍:经过我的测试,Gemini Flash 1.5拥有Gemini pro 1.5大约90%的性能,但价格只有它的1/10,在图片清晰且instruction明确的情况下,它在描述图像时可能仅仅会偶尔犯一些小错误,适用于图片数量较多(500张~5000张)的数据集标注任务或不需要特别高精准度的数据集标注任务,它也是你目前能用到的性价比最高的模型。
实际使用价格(输入图像的像素为40万,输出的tokens数量适中)—0.3美元/1k images
Level2: Claude3 Haiku, Gemini pro vision 1.0
Cluade3 Haiku和Gemini pro vision 1.0大约拥有Claude3.5 Sonnet 70%的性能或Gemini pro 1.5 80%的性能,在图片清晰且instruction明确的情况下,它在描述图像时可能会经常犯一些小错误,适用于图片数量非常多(5000张以上)的数据集标注任务或仅需要一般精准度的数据集标注任务,Gemini pro vision 1.0是你目前能用到的最便宜的模型。(注意:Claude3 Haiku和Gemini pro vision 1.0在输入图像的像素大于60w时才能较好地工作)
实际使用价格(输入图像的像素为60万,输出的tokens数量适中)—Claude3 Haiku:0.35美元/1k images, Gemini pro vision 1.0: 0.15美元/1k images
使用建议:规模较小的数据集(500以下)—Claude3.5 Sonnet;规模适中的数据集(500~5000)—Gemini Flash 1.5;规模庞大的数据集(5000以上)—Gemini pro vision 1.0
Based on the performance of these models, I have categorized them into the following levels with detailed descriptions:
Level 0: Claude3.5 Sonnet, GPT4o, Gemini pro 1.5
Description - They have the strongest image perception capabilities, the most accurate description abilities, and the richest data corpus. They rarely make mistakes when describing images, making them suitable for data annotation tasks with a relatively small number of images (under 500) or those requiring extremely high accuracy (Gemini pro 1.5 is slightly less capable than Claude3.5 Sonnet and GPT4o).
Actual usage cost (for images with 400,000 pixels and a moderate number of output tokens) - Claude3.5 Sonnet: $4/1k images, GPT4o: $6.5/1k images, Gemini pro 1.5: $2.8/1k images.
Based on my tests, Claude3.5 Sonnet and GPT4o have roughly the same image processing capabilities, but GPT4o's price is over 1.5 times that of Claude3.5 Sonnet and Gemini pro 1.5, so I recommend Claude3.5 Sonnet and Gemini pro 1.5.
Level 1: Gemini Flash 1.5
Description: My tests show that Gemini Flash 1.5 has about 90% of the performance of Gemini pro 1.5, but its price is only 1/10 of the latter. In cases where the images are clear and the instructions are explicit, it may occasionally make minor mistakes in describing the images. It is suitable for data annotation tasks with a larger number of images (500 to 5,000) or those that do not require exceptionally high accuracy. It is currently the model with the best cost-performance ratio.
Actual usage cost (for images with 400,000 pixels and a moderate number of output tokens) - $0.3/1k images.
Level 2: Claude3 Haiku, Gemini pro vision 1.0
Claude3 Haiku and Gemini pro vision 1.0 have around 70% of the performance of Claude3.5 Sonnet or 80% of Gemini pro 1.5. In cases where the images are clear and the instructions are explicit, they may frequently make minor mistakes in describing the images. They are suitable for data annotation tasks with a very large number of images (over 5,000) or those that only require moderate accuracy. Gemini pro vision 1.0 is the cheapest model currently available (Note: Claude3 Haiku and Gemini pro vision 1.0 can only work well with input images larger than 600,000 pixels).
Actual usage cost (for images with 600,000 pixels and a moderate number of output tokens) - Claude3 Haiku: $0.35/1k images, Gemini pro vision 1.0: $0.15/1k images.
Recommendations:
Small-scale datasets (under 500 images) - Claude3.5 Sonnet;
Medium-scale datasets (500 to 5,000 images) - Gemini Flash 1.5;
Large-scale datasets (over 5,000 images) - Gemini pro vision 1.0.