On SDXL and its captioning data (and all other public models)

Some of you may recognize me from my LoRAs manipulating overfitting in SDXL in order to apply strong changes with very small image sets, as well as to improve overall prompt adherence.

Over time, these LoRAs have improved, but this weekend, I might have had a breakthrough unrelated.

Currently, all existing models face a generalization issue involving two key components:

Proper Noun Pollution: Models cannot generalize proper noun data, which weakens their semantic understanding. Various methods can suppress overfitting on certain proper nouns, but these names still permeate major models (for instance, Taylor Swift persists like a ghost in the machine if you pay attention).
1. This problem becomes evident if, for example, you want to generate an image of a woman named Lisa who is moaning. SDXL tends to bias towards an image of the Mona Lisa. While it won't directly generate the Mona Lisa, you'll notice something odd: generating a series of different women moaning will result in a higher rate of aberrations in the number of fingers, messed-up eyes, and other anatomical issues, especially if they semantically or physically relate to overfitted proper nouns.
Resultant Overfitting: This overfitting occurs when models fixate on specific concepts or images due to an overabundance of similar data in their training sets. This fixation leads to a degradation in the model's ability to generate diverse and accurate representations. As a result, when attempting to create variations of an image, the model produces outputs with significant defects or inconsistencies, such as incorrect anatomy or distorted features. These issues are more pronounced in areas heavily influenced by overfitted data.

Here's the solution:

Generate Variations: Create prompts with significant variation to "smooth out" overfitting. For instance:
- Envision a cat whose fur mimics the vibrant red and dotted texture of a strawberry, without any actual strawberries present.
- Picture a feline whose coat looks remarkably like strawberry skin, complete with a rich, red hue and tiny seed-like speckles.
- Imagine a cat with a unique coat pattern that resembles the outer surface of a strawberry—bright red with small, seed-like details.
Use a Strong Prompt for Creativity: Develop a strong prompt for creativity in an LLM model and break it into 32+ prompts with varied manners of describing the same image. This method effectively smooths out overfit areas, allowing for quieting overly weighted regions.
Boost Signal: Once overfitting is reduced, you can increase the signal (e.g., crank the CFG to 20) to achieve desired results. If the model is no longer overfitted, it won't "burn in," and it will likely achieve generalization.
1. https://civitai.com/images/12391521
2. Notice the CFG is 25 with no burn in.

By applying these steps, you can mitigate the impact of overfitting, leading to improved prompt adherence and more coherent image generations.

Once this is done, you can boost the signal and achieve the desired results, e.g., crank the CFG to 20. If you are no longer overfit, you won't "burn in," and you'll likely achieve generalization. This is why Fooocus works and why prompt engineering can turn a previously incoherent image into a coherent one. By stacking a wide range of elements into the tokenizer, assuming most concepts are not overfit, you move in the right direction in terms of smoothing out overfit tokens. However, this method is more effective as it specifically targets the desired image while smoothing at the same time, enabling access to image generations not possible within vanilla SDXL.

Training artist names, titles, and photograph names have done significant damage to the generalization of these models. Conversely, if fine-tunes carefully remove non-generalizable data from captions, models will become much more intelligent in a few generations.

The attached PDFs contain a primitive instruction set for building prompts that help models "smooth" out overfitting. I fed these prompts recursively through a model to make them less "overfit" themselves.

I hope the community will pick this up so we can improve it together, both the bandaid instructions and the actual model. Fixing captions can be done by machines and, if used in future fine-tunes, could make older models surpass current ones in prompt adherence.

People need to stop underestimating the capacity of these models to generalize and instead focus on what they can generalize carefully before feeding it to them. Without multi-modality, titles are nonsensical. Even humans struggle with this, and we are multimodal (e.g., racism, sexism, judging intelligence based on hair style).

If we want models to generalize concepts like "inside of" or "outside of," numbers, etc., we need to consider:

Is tagging "70" on a photo from the 70s going to interfere with the signal clarity when generating an image with "70 bottle caps"? Models without time coherence cannot generalize time information properly.

Numbers should generalize. Multi-individual prompts should generalize. Concepts like the strawberry cat should generalize.

We need a collective, careful discussion about what these models are capable of learning to avoid wasting resources on minutiae that amount to machine superstition.

This also means that Fooocus can be improved with a concerted community effort, although if models are properly trained, it may not be necessary.

On SDXL and its captioning data (and all other public models)

Comments