JoyCaption: Alpha One Release

Following up on my previous article on JoyCaption (https://civitai.com/articles/7383?highlight=513741), training has completed with all the tweaks, so: Alpha One demo time.

The Demo

https://huggingface.co/spaces/fancyfeast/joy-caption-alpha-one

WARNING

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

This is a preview release, a demo, alpha, highly unstable, not ready for production use, not indicative of the final product, may irradiate your cat, etc.

JoyCaption is still under development, but I like to release early and often to garner feedback, suggestions, and involvement from the community. So, here you go!

What's New

First and foremost, based on feedback, I expanded the dataset in various directions to hopefully improve: anime/video game character recognition, classic art, movie names, artist names, watermark detection, male nsfw understanding, and more.
Second, and perhaps most importantly, you can now control the length of captions JoyCaption generates! You'll find in the demo above that you can ask for a number of words (20 to 260 words), a rough length (very short to very long), or "Any" which gives JoyCaption free reign.
Third, you can now control whether JoyCaption writes in the same style as the Pre-Alpha release, which is very formal and clinical, or a new "informal" style, which will use such vulgar and non-Victorian words as "dong" and "chick".
Fourth, there are new "Caption Types" to choose from. "Descriptive" is just like the pre-alpha, purely natural language captions. "Training Prompt" will write random mixtures of natural language, sentence fragments, and booru tags, to try and mimic how users typically write Stable Diffusion prompts. It's highly experimental and unstable; use with caution. "rng-tags" writes only booru tags. It doesn't work very well; I don't recommend it. (NOTE: "Caption Tone" only affects "Descriptive" captions.)

The Details

Check my previous article for more details on what's been going on behind the scenes, but the short version is that I spent a long time trying to get Training Prompt mode working and failing miserably. It remains unstable and will tend to go haywire, going into a spiraling repetition loop. So while it kinda works sometimes, I can't recommend using it yet. Also it has picked up on some idiosyncrasies in the training data due to lack of data in that mode, so more work is needed anyway.

That said, I'm quite happy with the new length conditioning controls on Descriptive captions. They help a lot with reducing the verbosity of the captions. And for training Stable Diffusion models, you can randomly sample from the different caption lengths to help ensure that the model doesn't overfit to a particular caption length.

Caveats

As stated, Training Prompt mode is still not working very well, so use with caution. rng-tags mode is mostly just there to help expand the model's understanding, I wouldn't recommend actually using it.

Informal style is ... interesting. For training Stable Diffusion models, I think it'll be helpful because it greatly expands the vocabulary used in the captions. But I'm not terribly happy with the particular style it writes in. It very much sounds like a boomer trying to be hip. Also, the informal style was made by having a strong LLM rephrase half of the existing captions in the dataset; they were not built directly from the images they are associated with. That means that the informal style captions tend to be slightly less accurate than the formal style captions.

And the usual caveats from before. I think the dataset expansion did improve some things slightly like movie, art, and character recognition. OCR is still meh, especially on difficult to read stuff like artist signatures. And artist recognition is ... quite bad at the moment. I'm going to have to pour more classical art into the model to improve that. It should be better at calling out male NSFW details (erect/flaccid, circumcised/uncircumcised), but accuracy needs more improvement there.

Feedback

Please let me know what you think of the new features, if the model is performing better or worse for you, and anything else! Feedback, like before, is always welcome and crucial to helping me improve JoyCaption for everyone to use.