JoyCaption: Progress Report

Wow, it's been over a month since releasing the Pre-Alpha of JoyCaption. And despite being a pre-alpha, it seems to have been well received and recently integrated into CivitAi. Very cool!

Work continues on JoyCaption, so here's an update on what has been going on behind the scenes since that early release.

Trials and Tribulations

My original plan for JoyCaption was to pre-train it on ~500k high quality, long form (200 words) captions. I figured that by giving it lots of information per example this would provide the pre-trained model the best understanding and maximize the efficiency of training. From this the model could be fine-tuned on a smaller set of carefully chosen and written captions to guide it towards the desired final state. That final state being a captioning model that could operate in different modes: Long Descriptive which would be similar to how the Pre-Alpha behaved, and Training Prompt mode which would write captions more like they're Stable Diffusion prompts, with varying length (between 10 and 200 words), a mixture of natural language and booru-style tags, and a variety of writing styles, all meant to mimic how users typically write prompts for diffusion models.

The Pre-Alpha release was more-or-less that pre-training phase. That's why the currently released model is very verbose and clinical in its phrasing. The next step was to focus on Training Prompt mode which I had hoped wouldn't take too long.

After writing 1,000 training prompt examples and retraining the model, I noticed two problems: 1) the model was struggling to follow the new writing style and 2) the model would very frequently go haywire (go into a repetition loop).

I tried doubling the training examples number by writing another 1,000 prompts, but the problem persisted. It takes me about a day to write 100 prompts, so at this point it had been about 20 days since pre-alpha's release.

Well, long story short, I ended having to do some major training tweaks: Repeating the training prompt examples during training to artificially inflate their representation, training the vision module, running the training for twice as long, enabling a LORA on the LLM module, introducing a third mode Booru Tags Only, with 100k examples, where the model is trained on captions that are purely booru tags.

Many, many days of training later and the current in-progress model is ... mildly okay at writing Training Prompts now. It's still suffering from the previously mentioned problems, but less often.

So it seems that the pre-trained JoyCaption just isn't adapting quickly enough to new modes like I had hoped and it's likely upwards of 10k manually written Training Prompts would be needed to get JoyCaption where I want it. And that, quite frankly, is exhausting work.

From the Top Again

What I suspect is going on is that while 500k long-form captions is great for building a strong model, it's causing the Image->Text adapter (the only part trained during pre-training) to over-train on that specific task instead of more generally providing information to the LLM module, and so the model as a whole struggles during fine-tuning to adapt and generalize.

To solve this I'm going back to the beginning and revamping JoyCaption's pre-training data and training:

Original 500k long-form captions (~200 words each), the same strong backbone.
Adding 200k medium to short length captions (10-100 words), to help generalize caption length.
Conditioning on the caption length.
Informal rephrasing: Half of the captions will be rephrased to use informal language, to help the model generalize beyond its current clinical style.

This should hopefully provide several benefits, chiefly improving the pre-trained model's generalization and thus ability to handle the illustrious Training Prompt mode. But it also means that JoyCaption will get some knobs that end users can use to tweak its output instead of everyone being stuck with the same stuff. You'll be able to pick the approximate length of caption that you want, the type (descriptive, tags only, training prompt), and whether to use formal/clinical phrasing (Photograph of a dog running in a lush field of grass) or more casual/informal phrasing (Pic of a dog running in some green grass).

Other Updates

In addition to those changes, I've also expanded and tweaked other aspects of JoyCaption's dataset to help address some of the initial feedback from the Pre-Alpha release. The following should hopefully be improved a bit: phallus state (erect/flaccid, circumcised/uncircumcised), attractiveness, anime/video game character names, classic art recognition, movie names, director names, artist names, watermark detection.

All of that will go into the next training runs and, if the AI gods are kind, I'll see some good progress being made. Fingers crossed?

Thank you to everyone who has provided feedback on the Pre-Alpha. ❤️

JoyCaption: Progress Report

Trials and Tribulations

From the Top Again

Other Updates

Comments