A Visual Guide to Training and AI

About the author:

I have four thousand working hours and 100k training hours, am I an AI expert no.

I had over two thousand working hours in before I knew that selecting CLIP training did not actually train the CLIP (Directly)
At the four thousand hour mark I learned about Byte Pair Encoding

After thousand of trainings and hundreds of failures what have I learned?

Know Your Model and Your Tools

What do most of the popular models have in common.

UNET
CLIP
VAE
Some models have a LLM or TE like Flux.
The primary focus of LORA training is the UNET.

It is important to note even in a full finetune the CLIP model is never directly trained. And 99% of the site base here has not trained a VAE or a CLIP.

So you may ask why should I know about CLIP and VAE, I just want to train LORA's? Also what the hell this is supposed to be a visual article, well I commend you for reading and not skimming lets get to that.

The Apple Test

For the purpose of the article imagine that apple and dog are poorly trained subjects.

So how would you caption this image for training a LORA? I think the majority are using an AI tagging or captioning system so lets look at the output from the onsite tagger.

food, apple, no humans, fruit, white background, simple background, food focus, still life, realistic

So what could be improved with these tags?

Red Apple - Color information is important for text to image pairs.
Remove fruit and food focus, this is not a compilation of fruit.

An optimal tag would still need human direction. And would contain lighting and color descriptions, as well as how close the object is to the camera. White background is a good tag, simple background could be removed, in my opinion.

My suggested Tag

A close up high definition photo of a realistic red apple, apple lit from above against a white background.

The Dog Test (The Perfect example of a bad caption)

Onsite Tagger Results

no humans, animal focus, realistic, animal, tongue, whiskers, tongue out, solo, black eyes, open mouth

The onsite results for this image are a perfect example of what not to train, let me clarify.

Assuming you are training UNET/CLIP (CLIP UNET Projection) what are information are you giving the LORA.

Generic animal focus, lack of information about the ball, the mouth is open but the tongue is not out, lack of depth of field or bokeh tag.

This one image with that caption could ruin the lora. While it is not a high quality image if we want to use it, (its of a pet or for some other reason) my suggested tag:

A photo of a golden retriever with a green ball in its mouth, mouth focus image with depth of field.

Commas = Tokens

You might have noticed I am using natural language and not comma separated tokens, but I want to train PONY I need WD14 tags right. That is not the case, although most of my PONY lora's use WD14 captions do to the ease of use, a better practice is to allow the CLIP model to do its work.

So we need to have at least a basic understanding how the UNET, CLIP and VAE work together.

The UNET

The UNet alone is not a diffusion model, but it is an essential component of the reverse diffusion process in diffusion models. Role - Images

The CLIP

The CLIP model provides contextual information about image/text pairs for Stable Diffusion models the Vision part of the model is removed providing text guidance only. Role - Text Guidance

The VAE

The VAE has multiple functions, both encoding and decoding the image into a latent space. And generating latents for training. Role - Multifunction

The Most Important Consideration

Your 1024x1024 image or even your 2048x2048 image gets broken down into smaller images for SDXL/PONY imagine this is what the model "sees".

When tagging an image you are not describing the positional information to the UNET, the UNET is provided the information via the VAE. Otherwise how would it reconstruct the apple out of those little chunks.

But why do we need to tell the model the dog has a ball in its mouth, if the UNET knows where the ball is? For this we need to consider CLIP training

Projected CLIP training

I argue for using the term "projected clip training" as the CLIP-L is not directly trained but rather the UNET model or LORA has "TEXT" blocks that influence the CLIP.

How does the CLIP work? This is quite complex but we need to understand it to guide our captioning.

CLIP-L has the ability to tokenize from natural language. However if you feed it a single word followed by a comma it will use that as a token, such as "Dog" or "apple" - For large words or out of vocabulary words it might take up several tokens.

Given natural langue it will likely develop better paths for its text guidance. It is also likely to use the natural language to choose more accurate tokens, that guide the UNET

Example:

Imagine the golden retriever image and the image above, both only had the tags; "dog" and "ball". This is a very likely outcome. What effect will it have?

As mentioned the UNET knows from the training image the dog is on top of the ball, but how will the users of your LORA prompt that image?

What use was training the Projected CLIP if your providing it with the same two tokens?

Could the LORA work with just dog and ball tokens?

Yes if the dataset was just dogs on top of balls, by training just those two tags you would guide the clip to make that association.

But could you have had a far superior LORA by describing each image? In my experience yes, higher quality captioning yields better results then 100's or thousands of images.

Should you only us onsite Natural Language Caption

Short answer is no. For smaller models without the LLM guidance the captions provided by BigAsp are far too wordy. It is likely you would hit the 77 token limit of CLIP-L before any useful tokens where inferred.

Unless you are training a model with a T5 or LLamma that have a 500 token limit and guide the CLIP-L, it is very easy to hit the 77 token mark.

Key Points

Have a basic understanding of UNET, CLIP and VAE
CLIP is never directly trained even during a finetune - "Text" blocks in the UNET are
CLIP-L was intened to tokenize from natural language, not feed 100's of single tokens
CLIP-L has BPE and can make new tokens not in its vocabulary (Such as Characters or Celebritys)
Better captioning is superior to a larger dataset

Note: The goal of this article is for those who come after to not repeat the same mistakes that I have. I have trained many poorly captioned LORA's that are popular, but they could be better.

Guide to Training and AI (Written with 4k working hours)