Universal CLIP

The number of possible token combinations with the 47k vocab at FP32 is more than all the atoms in the universe. That is to say removing start and end token 77-2 we have 10¹³² number of possible outcomes per seed.

The Basics

The LAION model is incredibly well-trained, so much so that only those with access to 500–750 A100 GPUs could realistically fine-tune it properly.

So why would I use any CLIP-L model other than the base one?

Take a look at the attached list of over 800 North American birds. Impressive, right? What's even more impressive is that each bird on that list has an image associated with it in LAION.

But here's the catch: LAION likely didn’t train on diverse text associations.

What does this mean?

If you've accurately described a name with minimal text, you're relying mainly on image feature extraction for those details. However, for diffusion tasks, we use the text model, not the image feature model (Vision).

For example, if the base LAION model has "Acadian Flycatcher" with some metadata related to a bird society, it will show a high cosine similarity when tested against the image features of that bird.

But it won’t help with text-to-text descriptions or using an LLM to guide the image creation.

Which has more power to guide an LLM:

A simple name trigger, relying on visual feature mapping, or
A detailed text description of the bird?

Photograph of an Acadian Flycatcher perched on a dark brown branch. The bird has a white underbelly, light brown upperparts with darker streaks, and a pale pink beak. Its small, dark eyes are focused upward. The blurred green and yellow background suggests a natural, forested setting. The bird's tail is slightly raised, and its feathers appear soft and well-groomed. The overall composition emphasizes the bird's delicate and subtle beauty.

When to use this CLIP:

PONY: 90+% improvement with JoyCLIP-L or this model.
SDXL: Better for NSFW tasks and compatible with JOY-PONY-G.
FLUX: Improved NSFW handling but some loss in base FLUX text abilities.

Avoid using it with Illustrious: Illustrious embeddings don’t align with any known vision models.

Further Training Details

How far did PONY CLIP deviate from the base vision model? About 10x further than you'd ever want to see. It had a massive gradient update (50) during the first epoch. To put that into context, a gradient value of 10 usually signals a "gradient explosion."

I didn't save the training graphs for the 100k PONY model, but it was the first training to align PONY with LAION, requiring an intermediary vision model. This makes it a good base for training PONY CLIP, as you'll only face gradient updates in the range of 5–10.

JoyCLIP was a huge improvement for PONY, but the dataset didn’t work well with base CLIP-L. This failure was actually useful, as it allowed me to refine the dataset further.

As I mentioned earlier, improving the FP32 base model is a tough (if not impossible) goal because it was trained with 1–2 MW/day at a batch size of 80–120k.

So, why train the base model at all?

In the case of FLUX, it's all about guiding LLMs on NSFW tasks. While most of us don’t have supercomputers or power plants, we can still refine the output of LLM-guided NSFW content.