Captioning Masterclass: Multi-Stage Deterministic Captions and Tagging

Update 4 - 9/23/24: Updates, updates, updates.

GIF and MP4 extraction

This took a bit longer than I wanted it to, but it does some what it's supposed to do.
Gif and MP4 extraction. It uses gfps to source an amount of tags at maximum per second with a similarity check for a few reasons. If they are too similar, they are automatically pruned, unless the image threshold assigned to the request is higher.
minimum_resolution maximum_resolution -> automatically resizes smaller animations and videos after sourcing.
require_segments -> a list of required segmented class_ids for keeping images or continuing in the chain to try to find more.
crop_position -> assigning this a value based a list of offset string identifiers; "crop to subjects", "crop to objects", "crop to faces", "crop to torsos", etc. There will be a list that uses segmentation and pooled for the cropping job.
forced_slices: int: >0 -> 0 disables manual slicing, forces slicing in a specific interval. If sliced, a video is treated as individual sub-videos and the tagging thresholds apply to those instead of the full video.
gfps / vfps per slice -> assigning a maximum frames sourced per second from an animated sequence.
minimum_frames, maximum_frames per slice -> the minimum and maximum amount of frames sourced from a single request.
similarity_threshold: float 0 to 1, -> determine the similarity value for pruning this image vs the last, meant to prevent overlapping and images that are too similar. Turn off for more subtle animations.
shift_detection_sensitivity: float 0 to 1 -> 0 disables 1 is inverse similarity threshold, automatically detects drastic video shifts and then attempts to segment based on the drastic shifts.

Image File Sorter

I ran into an issue when sourcing more than about 50k images. Windows explorer lags out and forces me to close the process.
My solution is simple, move everything into folders of X amount of images.
V1 - alphabetically sorts the images and their tags into a series of subfolders.
- everything can be configured.
- automated gif and mp4 parsing is enabled by default and can be turned off.
- images_1, ... images_n are the folders containing the images.
- missing_tags_1, ... missing_tags_n hold any images without tag or caption files.
- unknown_files_1, ... holds all unknown or unprocessed files due to the limitations of the list. A catch all for the rest of anything. Likely will be full of
- gifs_1, ... gifs_n hold all original gifs parsed or not with their tag files.
- processed_gifs_1, ... processed_gifs_n holds all processed gifs and their duplicated tag files.
- processed_videos_1, etc same as gifs.

Update 3 - 9/21/24: NSFW detection and GPT4 requests.

I've further refined the tokenization and definition process. The core underlying structure has been revamped into three core utils, one of which is completely jammed with tokens so copilot doesn't stop generating due to NAUGHTY NAUGHTY words, even though they're literally just a giant list of things that can be identified. Naughty naughty dear programmer, tragic that we can't help you with your horrible abomination experiment because you used naughty blacklisted words IN YOUR PROGRAMMING!!!
In any case, I've identified a useful NSFW checker that can allow the configuration to define a threshold to send images to GPT4o-mini, or default to the systemic LLM offline only.
The testing using this template seems to show some promise, so I'm going to work with it a bit and instead work out a simple json return template then parse. Hopefully that'll be a bit more concise than what I'm currently trying to do.

Update 2 - 9/21/24: Current Progression

Efforts are steady.
I've incorporated a paraphraser that increases caption fidelity at an overhead cost. It'll be on by default but can be toggled off. The tests have shown about a 20% increase in classifier inclusion based on the paraphraser, and about a 4% increase in direct caption to tag inclusion.
Caption to tag inclusion still suffers. The LLMs really don't like using them, so I've looked into various ways to convert tags to plain English. The final route will likely be to feed the entirety of safebooru's character tag list into GPT4o and hope it's kind to me or attempt to run it on a 100b+ unchained model and hope for the best. I don't have all eternity and that alone would solve a whole lot of problems.
It runs the entire detect suite from the imgutils systems, which allows for a fair list of assessments of what is there, including things like pony and fox faces with body structures and an approximate size based on ratio of image coverage.
I know it's not the most elegant, but it gets the job done. There's a function in the util that converts the classifiers to either a plain English inclusion or a booru tag equivalent.
supports both GPT4 and the local vision LLM as well as joycaption and joytag in conjunction.
Supports a full caption fixation subsystem with accuracy detection and adjustment to the LLM per loopback. Getting dangerously close to me wanting to finetune my own English language tiny LLM with booru to plain english conversions at this point just to cut down on some of this pain.
The identifying marker is currently treated more as a "pre caption" injector, which forces something in before a caption. I'll include a more specific and useful way for this.
Automatically preserves existing captions + tags and merges new tags if the flag is set, even using the original captions if they exist. It determines caption by size of tag sequence being longer than a certain value, so short captions won't be picked up by default and will likely just be pruned. The project will include a full compounded list of all major booru tags with attention to artist, character, and series.
Ignore the comments for direct number % values, I had it set to different numbers when I populated those.
Caption accuracy is an accumulative calculation based on WHAT IS in the system, WHAT WAS identified, and WHAT IS TO BE expected from the system on the next caption loopback. So whatever is missing is expected, when that expectation isn't met then the values are adjusted. Whatever is missing from the caption via segmentation validation, booru tagging, original tagging, and classifier fixation is all calculated to determine overall accuracy.
overall_english_similarity_accuracy
- based on the english language itself's similarity to the identified required tags.
- girl != 2girls, 2 missing characters with 5 uses, penalize accordingly.
- Usually less than 5%, I'm honestly not sure what to expect with this one so I'll be treating the threshold as 1 when calculating the final average. Best left at a low value. Planning a whole English ruleset is far beyond the scope of this project, but it's not impossible.
overall_tag_inclusion_accuracy
- based on the tags direct inclusion in the caption or not, this is often very low.
- 2girls < classifier -> gender female, multiple subjects, shared context exist in caption?
- Accuracy is between 0 and 25% depending on how commonly used the tags are and how many plain English derivatives match what can be pluralized and compared to within the core tokenizer.
overall_segmentation_accuracy
- calculates the amount of identified captions and their found classifiers to determine if the caption itself does indeed hold a grasp on the desired and observed segmentation.
- people list < 2, caption expects details for 2 subjects, whether it be two types of overlapping character trait, or two types of inferenced conceptualized actions.
- This is often very accurate. You'll often find a 100% accuracy based on caption to similarity with segmentation.
overall_caption_accuracy
- The full calculation formula is pretty simple; we average and then use that.
  - a + b + c / amount
  - using the average of the 3 gives us an overall accuracy based on the goal, which gives us a threshold to tag our image's accuracy tag.
Does the caption and tagging actually work? I have no fucking clue at this point. The outcomes are so sporadic and chaotic that I have no idea if it'll actually work. They often look like absolute abominations of wordplay. They look like someone wiped with the paper and left it in the stall, and yet somehow the math aligns and the English parser has no beef with the language.
It's highly probably that the first batch I train with this will be a complete dud, but the accuracy of tag to caption ratio is high enough to give me a piqued interest. Am I actually learning the math of the T5? Is this actually how T5 translates what we send it?
Only time will tell. I am using T5 to translate to T5, so it's possible that the finetuning from quora is gutting it, but it's also possible that I'm looking a bunch of context with the LLama based JoyCaption. We'll see what happens.
I plan to caption and train the original 2k (which turned out to be more like 1400?) images with 80% overall accuracy threshold captions today and then use the images to determine if this process is valid or not.
In the process I've been slowly sourcing important tagged images from the boorus and I'm up to about 100k images from the boorus.

Update 1 - 9/18/24: Large Segmentation Prep Expansion

I haven't released due to such a drastic pivot when I noticed how many segmentation options are present in a very simplistic to install manner.
https://dghs-imgutils.deepghs.org/main/tutorials/installation/index.html
A task managed batching pipeline sequence that loads, identifies, and cobbles segmentation into viable and useful systems.
all listed segmentation options and their args contained within the dghs-imgutils module.
task name -> args
Directly saving a defined subdirectory with a core .aseg plain text file with all the identified segmentations, their positions, and their math. Task requested sub masks will be saved; such as a full segs mask and a full bbox mask.
Direct incorporation with this is far beyond expectations for segmentation and I plan to FULLY incorporate a multiprocessing asynchronous segmentation queue that uses workers and proper loaders with accelerate.
The prototype is already ready and I will begin testing phases very soon.
Incorporation of all segmentation options and their args as individual assignable and sequence capable tasks.
Incorporation of load/unload models on pool task completion for certain segmentation models to save vram.
Full unload of segmentation models based on task model requirement, after the model's task pool is fully is complete and the system expects no more uses of that model during this task sequence.
I'm uncertain how many batches per second this can handle, but it's quite a few. These are not intense nor large models and this code here is formatted with multiprocessing async as a centerpiece.
There are many utilities for mask preparation in this system; many of which are useful for bbox training, identification, segmentation training, and so on. Be advised though, when sampling data using already trained segmentation mask systems, you can finetune them into additionally more powerful models; but the error rate will increase due to the already introduced error rate from the current model. You may introduce more information, but there also needs to be some human intervention to allow for additional new information and details to be included.

We have reached a point with auto-taggers like LARGE_TAGGER_V3 and JoyTagger that allow for a more... deterministic language attention to be applied on top of everything. Let me break this down into a few simplistic concepts and then we can all understand what my over-complex analysis really means. I know I'm a bit impossibly verbose at times so bare with me.

We can USE the outputs of these zero_shot taggers for information to determine WHAT we want an LLM to do by shaping the information the LLM receives, and then shaping the information returned from the LLM and feeding it back into the same LLM again or into another for re-structuring and condensing.

This is a very advanced analysis of what I plan to build, so be COMPLETELY SURE you're ready to at least try to understand it.

I am currently building this project locally and I will host and give this project away for no cost on HuggingFace and the source code fully open source.

What is automated DETERMINISM in this case?

This is essentially, identifying specific tags and determining their utility to the overall image in question; automatically. It shares a direct kinship to tokenizing and parsing a programming language, which is something I'm highly familiar with, having done this to 6 of them over my years. I think it was 6? They weren't very good, but this is much simpler than what I built.
In a highly pragmatic sense, it takes an important tag of attention such as 1girl; determines a combination of simplistic other tags for example; 1girl, blonde hair, red dress, pink shoes, red eyes; and then formats them into a plain English prompt. Something HIGHLY deterministic and this stage does not require AI, just a tokenizer:
- a girl with blonde and red eyes wearing a red dress and pink shoes.
- This alone should be acceptable enough to actually form a caption capable of identifying to T5 what it's looking at, but not good enough for more complex images.

Setup - LARGE_TAGGER_V3 or Booru sourced tags.

~~There is two potentials here; You sourced the image and it has tags, or you have an image with no tags and you require tagging.~~
~~In either case it would be wise to run LARGE_TAGGER_V3 anyway and then remove duplicate tags after. Keep characters, artists, or whatever you want to keep.~~
We want to completely and automatically prune tags that are impossible such as; 1girl and multiple girls. There's quite a few of these combinations thanks to boorus having overlapping tags, but not too many that it would take a long time to do automatically.

Setup - JoyTagger:

I have fully integrated both JoyTagger and JoyCaption into the same project with toggles, switches, and a systemic approach to analyzing and feeding the caption system the correct prompting.
JoyTagger takes less than a fraction of a second, while JoyCaption when set to about 50 tokens takes nearly one second. ~~The optimizations of this current setup on my 4090 are quite good.~~
I broke the multiprogramming and accelerate optimizations when including too many diverse and divergent models. I'll need to spend a day or so working out the logistics on ensuring everything fits together orderly and within the multiprocessing paradigm that I came up with originally, and iteratively modify based on errors and model stubbornness.

Image Preparation Segmentation

~~We take a nice onnx model and start looking for things. We want to figure out what our segmentation system sees in an image as well.~~
We take a nice gigantic heap of onnx models that all run in sequence and take about a second to segment an image into upward of 100 different identified humanoid and human features including things like:
- Faces
- Feet
- Hands
- NSFW elements
- Clothes
- Whole humans

Pre-Determinism Segmentation Setup:

We run segmentation looking for things like counting subjects, hair color, clothing color, and so on. Basically sample everything that it'll let you and then we can determine the average color programmatically.
These models are pretty fast, so it shouldn't take too long.

Post-Determinism Segmentation Error Checking:

Lets go ahead and figure out if what is in the image actually matches normalization and the various color name hex values built into PIL based on sampling the average color of the segmented area.
We segment hair, and find that none of the hair is blonde, we determine the new color automatically. Nothing special here, just multi-subject fixation.
If the pieces cannot be found, we simply flag it for error percentage. Simple, we didn't find hair, there's probably no hair. These systems aren't the best at analysis, so there will be error rate here as well.

Staged Determinism:

Stage 1 - Pre-Caption

We take what is, to build a plain English prompt.
- 1girl, blonde hair, red dress, pink shoes, red eyes: becomes
- A woman with blonde hair and red eyes is wearing a red dress with pink shoes.

Our subject counter is ready.
Then we take this information here, and include it into the JoyCaption prompt for additional solidity.

Stage 1 - Post-Caption Loopback

We take this output caption and determine if it holds the same necessary identification tags that we used originally by running another deterministic word-by-word parsing pass 2 with error correction.
- "An elegant woman wearing a pink dress with blue shoes and red hair."
- ~~Well that match rate is less than 60%, we'll run it again with less tags.~~
- Loopback isn't that simple.
- Error Correction: After Nth amount of runs defined by the setup process, it'll determine a maximum amount of loopbacks before simply accepting what is as different and altering the core tags to match, determining that LARGE_TAGGER_V3 or JoyTagger are inaccurate and that JoyCaption has resolved the problem.
- ~~Based on fail rate, the seed will be changed and the caption shuffled for re-submission based on simple noise and our favorite RNG system.~~
- We calculate error rate logically and deterministicially.

Stage 2 - Post Caption Determinism

~~If the tag match accuracy rate is below~~ ~~60%~~ ~~A DEFINED PERCENTAGE~~ ~~after the loopback structure, segmentation analysis,~~ ~~English similarity, and tag count, the image should be flagged for additional determinism.~~
This entire basis has been restructured and reworded.
~~We do it again. In a 2x2 quadrant.~~
We identify and segment to determine validity within bboxes. We then use the bbox to determine the position within the image and calculate the probability of ownership based on a parsed person.
Even after that system, it will still have a fairly high error rate for a large amount of things such as hair color, hair styles, and so on, but it won't be so bad that it hits below a 60% threshold most of the time. Essentially if segmentation is enabled we can simply replace a lot of things automatically, but occasionally it will be dramatically bad, ranging from entirely included people who shouldn't be in the image, which is something we need to avoid when running automated tagging on a large scale; ranging to entirely incorrect formatted captions.
So what we'll do is sample the image in Nth amount of slices, where the image is simply divided vertically and horizontally and then captioned via section. A small amount of the image on the edges and borders will be blurred, as well as a percentage of the nearby sections are blurred based on an allocated % rate and included.
~~The smaller images should caption faster, but this requires experimentation on more powerful GPUs to be sure.~~
This will allow us to section our prompt into four key locations, as the mechanisms within Flux respond highly accurately to 2x2 grids and semi-accurately to 3x3 grids. We are going to use this to identify images based on quadrant.
This process is similar to how you would want to Img2Img loopback when you want to replace sections of an image without completely destroying and introducing noise to the sections.
~~We take the output tags and prune them using impossible, and then we normalize the tags based on probability.~~
~~Finally we caption it again and run Stage 1 one last time.~~
We loopback as many times as necessary using the ratios of accuracy to determine the outcome.

Stage 3 - Rebuilding simple logic.

~~Depending if Stage 2 is enabled or not; we may have completely butchered our prompt with Stage 2, or it may simply make no sense.~~
~~This is where our OTHER LLM models come in handy. There's quite a few of them around, so that's going to help out a bit.~~
We simply feed our caption into the 8b LLM of our choosing and then have it format it and retain it's elements. Any missing key elements such as hair color, eye color, position, and so on after can be post-deterministically applied again to retain solidity.
~~We now have our final caption and tags.~~

Final Stage - Post Caption Assessment and Tagging

These can be completely omitted with a bool flag or reworded as desired internally.
auto_best_caption, auto_good_caption, auto_okay_caption, auto_bad_caption, auto_worst_caption.
accuracy_90, accuracy_80, accuracy_70, accuracy_60, accuracy_50, accuracy_40, accuracy_30, accuracy_20, accuracy_10.
Based on accuracy of the caption with the sampled tags and any sort of included tags required by the user; we can assume that there will never be a 100% accuracy. At most we can assume the accuracy is 95%, which means "auto_best_caption" or "accuracy_90".
If the accuracy rate is below 40%(default and adjustable rate), the image will then be flagged and placed into a folder called "" and tagged accordingly with the "auto_worst_caption" tag with it's accuracy tag accordingly. You can deal with that all you want.
Manually tagged images should be tagged with "auto_best_caption" and "accuracy_90_up" due to you being human and determining the accuracy.

This may sound a bit overkill, and it really is. However, some images simply don't want to caption. Some simply cannot be identified without a series of passes and error checking. Some simply cannot be tagged automatically, and this system should flag those for direct intervention from the user.

This provides a blueprint for the progression of caption solving and I'll be finishing the first iteration of this project by hopefully this weekend with a LORA to play with based on the outcomes from it.