LlamaScribe: Instruction Based, local Image Captioning Tool For Lora Training using AI
LlamaScribe is a GUI-based batch captioning tool that uses ollama running locally to automatically prepare and caption images for lora training by providing natural language instructions.
LlamaScribe allows you to type, in plain English, what you want the ai to do while captioning images.
This means if you want to caption say 50 images for SDXL, you would type in to the advanced tab something like:
you are an ai captioning tool, caption the images using key words separated by commas. Use common tags used in online image sharing sites, some example tags to use may be 1woman, 1man, high quality, anime, fox, cat, describe the images in as much detail as possible while maintaining this keyword separated by commas structure in all captions.
If you wanted to caption your images with a template, for example a snow globe concept lora, you could type something like:
"you are an ai captioning tool that captions all images using the following template. An image of a snow globe, The scene inside the snow globe is (describe what is in the snow globe), The snow globe is made of (describe what the snow globe is made of), (Describe any text that is in the image), The background shows (describe the background), (Describe any additional details seen in the image). You should respond with only this template and nothing else, do not add any details other than what is directly seen in the image."
And if you wanted to get a little crazy and have it caption your images with poems, you can do that too by typing something like:
"you are a poetry expert tasked with captioning images, Describe the images in the form of a poem, be detailed and describe as much about what is seen in the image but your response has to be a poem, only respond with the poem and nothing else."
(It works, I tried it)
The Intention
The intention of this tool is to give you complete control over how the captioning is done without spending hours writing the captions by hand, or wasting time feeding images in to an ai captioning tool one by one. You can batch process hundreds of images at a time and all will follow the same instructions you type in to the advanced tab. The only limit is your imagination and computer processing power. Be aware the time it takes to caption the images is dependant on your machines processing power, if you have a slow machine you may see it takes a long time (this is true for all captioning tools that use ai).
This programme allows you to select a folder containing images and copies the contents to the working folder (where LlamaScribe.exe and LlamaScribe.pyw files are on your computer), by copying the contents of the file it prevents potential loss of data. The ai then generates captions using ollama, and optionally refines the captions with custom instructions you give it through natural language, written text, in the advanced tab. If you type nothing in to the Advanced tab it will default to captioning in natural language, ideal for flux style captioning.
If you wish to caption for SDXL, pony or other captioning styles, make sure to enter instructions in the advanced tab.
This tool is ideal for generating captions for lora training datasets but can also be used for anything you need descriptions of images for. You need a product description for a website? it can do that, just type the instructions in the advanced tab.
This programme works alongside Ollama, To download and install ollama https://ollama.com/ further installation instructions are below.
Features:
Image conversion: automatically copies and converts all images in selected folder to .png format.
Batch Image Processing: Process multiple images at once. This tool allows you to select an entire folder, and copies the contents to ensure no accidental loss of data.
Automatic Captioning: Generate captions using an AI model with the use of ollama.
Caption Refinement: Refine captions with custom instructions using a second model. (the instructions you type in the advanced tab)
Customizable trigger words: Add text before or after captions. Ideal for trigger words, meaning you can just zip up the folder, upload to civitai model trainer and not have to add anything to the captions.
Custom Prompts: Write your own caption templates or use the default. This allows complete control over how the ai will caption your images, you can tell it anything and it will follow the instructions. This allows for any type of captioning such as: tagging for SDXL/Pony, natural language descriptions, have it write captions as poems, write the description as if it was insulting you, literally anything, just type your instructions in to the advanced tab.
LlamaScribe processes images by:
copying contents of the folder you select to the working folder, leaving the original copy unaltered.
Converts images to png.
Generates Captions using an AI Vision Model (e.g., LLaVA).
Refines Captions using an optional Refiner Model (e.g., Qwen) while following instructions you typed in to the advanced tab.
Allows you to add Custom Prepend/Append Text to tailor captions. (trigger words for loras)
The tool outputs image captions as .txt files alongside the copied images in a newly created folder called "Processed_images_with_captions" . It names the .txt files with the same name from the corresponding image. This makes it ideal for use with the civitai onsite model trainer. you can just compress the images and .txt files in to a zip file and drag the zip file in to the civitai trainer, The site will show all images with the captions attached. This is also the same format required for ostris's ai-toolkit and most locally run lora trainers.
Installation:
Windows:
Ollama: install by going to https://ollama.com/ download and follow the instructions to install.
Once Ollama is installed you need to download your choice of models. you can select from available models here: https://ollama.com/search, to install a model once chosen, open a cmd window and type "ollama run (model name)" e.g. "ollama run llava" it will then download the model from the ollama site.
You need 1 of any vision model: https://ollama.com/search?c=vision (recommended to use llava as is lightweight and fast.)
You then can select any optional language model as a refiner model: https://ollama.com/search (you can use llava for both vision and refiner but I recommend qwen coder models for best results.)
Download the attached file on this page (the LlamaScribe zip file), extract from the zip file and run LlamaScribe.exe
To run in the future simply ensure ollama is running and double click the LlamaScribe.exe file.
Linux/MacOS (and windows using .pyw):
Ollama: install by going to https://ollama.com/ download and follow the instructions.
Once Ollama is installed you need to install your choice of models. you can select from available models here: https://ollama.com/search, to install a model once chosen open a cmd window and type "ollama run (model name)" e.g. "ollama run llava"
You need 1 of any vision model: https://ollama.com/search?c=vision (recommended to use llava as is lightweight and fast.)
You then can select any optional language model as a refiner model: https://ollama.com/search (you can use llava again but I recommend qwen or llama3 for best results. if just wanting basic captioning just llava is fine but better results can be acheived)
install python 3.8 or greater https://www.python.org/
install the following python packages: PyQt5 (for GUI), requests (for API calls), Pillow (for image processing)
To install python packages after installing python open a cmd window and type "pip install PyQt5 requests Pillow"
Download the attached file on this page (the LlamaScribe zip file), extract from the zip file and run LlamaScribe.pyw
To run in the future simply ensure ollama is running and double click the LlamaScribe.pyw file.
Usage Instructions:
Ensure you have ollama running before launching the UI.
launch ollama
Launch the .exe (windows) launch the .pyw (other platforms) - To Launch, just double click on the file
Select Folder: In the main tab Click the "Select Folder" button, and choose the folder containing only images that you wish to caption.
Prepend/Append Text: (Optional) Add text to include before or after each caption. LlamaScribe will automatically add spaces and commas appropriately.
Select Models:
Choose a Vision Model for image captioning (e.g., llava) from the dropdown menu which shows all available ollama models installed on your machine. Options for this are here https://ollama.com/search?c=vision. I highly recommend llava here.
### MAKE SURE TO SELECT A VISION MODEL HERE!!! ###
Choose a Refiner Model for caption refinement (e.g., Qwen coder or llama3). This is optional, you could just select the vision model again and it will work as intended, but for best results I highly recommend a coder model like Qwen as coder models are better at following instructions. Qwen coder can be found here https://ollama.com/library/qwen2.5-coder
Advanced Tab (optional and for advanced users)
Enter a Custom Prompt for the refiner model. Example:
"You are an AI captioning expert. Describe the image in extreme detail, describe only what can be seen in the image and nothing else."
Start Processing: Click "Start" to begin the captioning process.
Progress will display on the progress bar.
Outputs:
Captions are saved as .txt files alongside the images in a new folder named "Processed_images_with_captions", The newly created folder will be in the same location as the .exe and .pwy launchers:
files inside the new folder will be structured like so:
├── image1.png
├── image1.txt
├── image2.png
├── image2.txt
(To Note: running the captioner will replace the "Processed_images_with_captions" with the new results, deleting the old images and captions. this is by design to save memory space. Make sure to copy outputs you wish to keep to a different location before running again, (anywhere other than the "Processed_images_with_captions" folder) to keep the previously generated captions). If people wish for this to be changed I can look at it but It's the safest way to ensure the programme dose not clog your storage memory.
Advanced options:
In the Advanced Tab, write a detailed custom system prompt for the refiner model to guide how captions are generated.
Examples:
Formal Description: "Describe the image in professional detail, including people, objects, and text."
Narrative Style: "Write a short story about what is happening in the image."
SDXL captioning: "Describe the image using only keywords separated by commas, Lean towards using common tags used for filtering on image forums such as... Include tags such as 1woman, 1man, high quality, 4kresolution, anime,... "
These instructions can be as complex or as simple as you want. Just type what you want the ai captioner to do in plain English and it will follow your instructions.
Ollama Local API Endpoints:
The script connects to the local ollama api server at http://localhost:11434/api/generate. This is the default when installing ollama, if you have issues with the programme connecting to the local endpoint check you are running ollama, click start menu, type ollama, click ollama application.
### OLLAMA MUST BE RUNNING FOR LLAMASCRIBE TO WORK!!! ###
Changing The API Endpoint:
If you wish to change where the API Endpoint is running, as of LlamaScribe v1.2 and above, you can now open the config.txt where you can edit the line API_ENDPOINT=http://localhost:11434 replace the url with your endpoint url. This will now work for both python file and .exe users running v1.2 or above. This is intended for any users running ollama on second machine and anyone who has changed the default location. If you have any issues with this feature please let me know and I will see what I can do to help. This is an advanced feature and is intentionally not included in the UI to avoid issues.
Best Practices:
Ensure High-Quality Images:
Clear, high-resolution images improve caption accuracy and overall lora training.
Use Consistent Model Pairing:
Use LLaVA for captioning or any vision model, and Qwen or any llm for refining to get the best results. Although there are other models that pair well together such as llava and llama3.2. Its really a matter of what you prefer and what your system can run. Just play around with the models and see what works best. Note quality of results are dependant on the quality of the ai model. The better the models used, the better the captions and instruction following.
Write Clear Custom Instructions:
Be specific about the captioning style or details you want. Just tell it directly what its aim is, give it a template it should follow, tell it words you don't want included in captions.
Test Small Batches:
Start with a small folder of images to ensure everything works correctly before running on large datasets.
Verify API Availability:
You can confirm that the API server is running and models are available by typing this in to a cmd window:
curl http://localhost:11434/api/tags
If running in the correct location you should see a list of installed model names, if you get an error message then you are likely not running ollama or have changed default ollama setting.
If running v1.2 or above and changed the api endpoint in the config.txt check you have typed the correct url where you ollama is running. Make sure to config says API_ENDPOINT=(your api endpoint url here) with no spaces or quote marks.
Troubleshooting:
No Models Found: Ensure ollama is running and models are installed. open start menu type ollama and click ollama app.
Processing Errors: Check terminal logs for specific file errors or API issues.
Caption Refinement Issues: Ensure the refiner model supports text input. make sure your instructions are in plain text English.
Captions don't seem to be looking at images or are unrelated to image: make sure you are using a vision model in the vision model dropdown menu, If you are not using llava, try with llava to test, if it works you have likely chosen a model that is not a vision model. There are only a few models that have image processing available in ollama, these can be found here https://ollama.com/search?c=vision
Example Workflow:
Start Ollama
Start LlamaScribe.
Select a folder containing Lora training images.
Choose llava as the Vision Model and qwen as the Refiner Model from the respective drop down menu (or any models you have chosen, just make sure to select the correct vision model)
Add trigger words in the Prepend text field if you want a trigger word for your lora.
Click start.
Review the generated .txt files for each image.
copy generated images and captions in to a new location.
Contact & Support
If you encounter any issues or have suggestions for improvements, please feel free to DM me on site, discord, or leave a comment in the discussion section. I am now using this project as my main captioning tool and have been testing it, but I am not an expert coder and there may be issues I have missed. I will do my best to fix any issues with the programme as soon as they crop up and update this page as soon as I have the fix.
FAQ:
I ran the .exe through an antivirus scanner e.g. virus total or another software and it reports back as a virus, why is this? The .exe is a python script, the exact same one as the .pyw file with the python dependencies packaged together. The packaging was done using pyinstaller which some vendors report as malware outright. Also looking at the .pyw file you will see it uses import os which can often trigger some antivirus software. The .exe file was intended as a bonus for windows users to make it easier and not need to install python. If you do not like this or feel uncomfortable running the .exe you can delete the .exe file and follow the installation instructions for using the .pyw file. You can read the code the .pyw file contains directly by opening the .pyw file in a notepad and see that the only code included is necessary code for the programme to run.
How can I see what the programme dose or read the code?: This is simply a python script that has been packaged in to a .exe with all required python dependencies, for easy running on windows without the need to install python. You can see the script by opening the LlamaScribe.pyw in a notepad
Why is the python file extension .pyw and not .py?: In python the .pyw file extension just prevents the terminal from opening when launching the script, if you wish to see the terminal while using the software you can change the file extension to .py by renaming the file and it will launch the terminal when using the python script.
I'm using windows but don't want to use the .exe file, can I use the python script instead?: Yes, you can follow the same installation instructions as listed as Linux/Mac platforms and you can use the .pyw file and/or rename it to .py to have the terminal while using the script.
Can I alter the script or change things to be how I want them? Yes, I'm putting this out there for anyone to use, I created it for my own use but feel free to change and improve, Acknowledgments and credit appreciated but not required should you repost with alterations. All I ask is that, if you improve it significantly, please let me know so I can use it too :).
I'm going to be using the python file version not the .exe, do I still need to keep the .exe file? No, you can safely delete the .exe file if you are not using it and the .pyw file will still work. If you want to keep the nice looking llama logo you will need to keep the icon folder containing the icon.ico file but not required to work, you will get an error if you delete the .ico but the programme will still work. If you are reading this and do not know what this means, do not delete anything.
The ui loads when i double click the .exe and/or the .pyw launcher, I get no error but I can not see my installed models in the dropdown vision/refiner menus, how do I fix this? Fisrt: check you are running ollama, click start, type ollama, click ollama application. Second: if still having issues, where are you running ollama? if you have ollama installed on the same machine you are using it on then by default ollama is located at http://localhost:11434, If you have changed this or are not running it here, open the config.txt and edit the line that says API_ENDPOINT=http://localhost:11434, change the url to where your api endpoint is located. See ollama documentation for more details on how to find where your api endpoint is located.
Hope this is of use to people, thank you to anyone who takes a look, feel free to drop a like if you found this helpful.
Happy Captioning! 🎉