JoyCaption: Alpha Two Release

The second JoyCaption release, Alpha One, went well (https://civitai.com/articles/7540/joycaption-alpha-one-release). But it's time already for Alpha Two.

The Demo

https://huggingface.co/spaces/fancyfeast/joy-caption-alpha-two

WARNING

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

This is a preview release, a demo, alpha, highly unstable, not ready for production use, not indicative of the final product, may irradiate your cat, etc.

JoyCaption is still under development, but I like to release early and often to garner feedback, suggestions, and involvement from the community. So, here you go!

What's New

This release builds off of Alpha One with two minor tweaks and one bug fix.

More Training: I've increased the training time by 20% this go around, to help improve accuracy a little bit more (details below).
More modes, more flexibility: JoyCaption Pre-Alpha had one mode. Alpha One had four modes. Alpha Two? Nine modes! And 17 extra instructions to guide the captions! You do the math.
Bug Fix: I've never meant JoyCaption to generate captions longer than 256 tokens (that's its training limit to-date), but sometimes it goes a little over. That's okay. What isn't okay is it ending generation in the middle of a sentence! A bug introduced in Alpha One's training caused the model to truncate captions that go near the limit instead of finishing them cleanly. This should now be fixed.

More Training (Details)

One commenter mentioned on the last release (Alpha One) that the quality felt a little lower compared to Pre-Alpha, and I agreed. Of course, it's hard to tell with stochastic models like these. Importantly, Alpha One tended to write shorter captions, even in Very Long mode, which can make it feel like it is less accurate simply because its errors stand out more.

So, I wanted to address two things. How was Alpha One actually performing compared to Pre-Alpha? And, can we make it better?

To answer the first question, I have painstakingly and very manually scored the models against a set of 15 Validation Images. No version of JoyCaption has ever been trained on these validation images, so they are the golden standard I use to validate accuracy of any given version. When scoring, I always run JoyCaption in Very Long, Formal, Descriptive mode, which matches Pre-Alpha most closely. Each image has a carefully written, 100% accurate human caption associated with it, each about 200 words long. I score a given caption by counting each mentioned detail, adding +1 if the detail is correct, and -1 if the detail is incorrect. This produces a scoring system that heavily punishes mistakes. The human caption scores tend to be very high, since there are no mistakes and humans are far better at cramming details into a caption compared to SOTA AI models. With all that said, here are the results:

              | Human | Pre-Alpha | Alpha 1 | GPT4o | Alpha 2 |
Average Score |  41.9 |      22.6 |    21.5 |  22.7 |    22.8 |
Std. Dev.     |  6.30 |      5.32 |    6.41 |  3.88 |    5.32 |

JoyCaption punches toe-to-toe here with GPT4o, the strongest image captioning model I've personally tested so far. It only loses on variance, with GPT4o being far more consistent in its quality.

Most importantly, Alpha Two, this release, recovers the accuracy of Pre-Alpha.

More Modes (Details)

You'll find the Demo above to have a lot more modes and a lot more options. It was always my plan to slowly move JoyCaption towards being more like a traditional VLM where you can ask it, in natural language, what you want. This provides the ultimate flexibility to users. But of course, I'm just one dev, so JoyCaption v1 was meant to be just a simple captioner with a handful of modes to cover some common use cases.

But, based on feedback, there was a really important use case that I didn't want to miss. The user StableLlama on Reddit provided two pieces of feedback that really stood out to me. Their comment, in summary, is that for training character LoRAs it can be helpful if the caption doesn't mention physical characteristics of the character, since we want those "summed up" by the character name/trigger word (e.g. the model should already know Lola Bunny has rabbit ears). And, it'd be nice if JoyCaption itself could help incorporate name/trigger words into the caption.

So, I said FUCK IT and re-trained JoyCaption with 100k more captions based on a quick set of instructions I threw together. The instructions are nowhere near comprehensive, so Alpha Two isn't a general instruction follower. But the new Caption Types you'll find in the demo are part of this expansion of capabilities. The Extra Options are additional instructions that can be appended to JoyCaption's prompt to influence how it writes the caption. This includes, of course, "If there is a person/character in the image you must refer to them as {name}." and "Do NOT include information about people/characters that cannot be changed (like ethnicity, gender, etc), but do still include changeable attributes (like hair style)." Don't forget to fill in the name field if you're using that feature!

For transparency, you can see the prompt that was fed to JoyCaption, listed above the caption. And you can, of course, use a custom prompt. But as I mentioned, Alpha Two isn't very general and won't work well outside of the instructions it has been trained on. For example, asking it "Write a caption for this image using only emojis" does "work", even though it was never trained on that request, but I have yet to get it to successfully write a caption "in the style of a pirate."

This is a small step towards making adding these new modes easier in the future, and eventually making JoyCaption a generalist VLM.

Caveats

Like Alpha One, expect Training Prompt, Booru tags, and MidJourney modes to be very unreliable and experimental. Informal style remains ... interesting. And all the caveats from before regarding accuracy and OCR quality still apply.

The new Extra Options aren't silver bullets and don't have 100% reliability.

Feedback

Please let me know what you think of the new features! Feedback is always welcome and crucial to helping me improve JoyCaption for everyone to use. Are there specific instructions you want to see added? Other modes? Or other vision related tasks?