santa hat
deerdeer nosedeer glow
Sign In

What makes a good dataset?

[deleted]
May 23, 2024
training guide
What makes a good dataset?

There's an old saying in machine learning: A data scientist spends 90% of their time on the data, and 10% on the model.

When working on a deep-learning model (or any AI/ML project in general), the most important aspect, usually by a wide margin, is the quality of your dataset!

All the finely tuned hyperparameters in the world don't mean anything if your data is shit!

Whether you're looking to train your own LoRA or planning to commission one, keep these concepts in mind to help guide the model to the best possible outcome during training.

TLDR;

I've got waifus to train, just cut to the chase.

  • Have around 30-50 images.

  • Don't use small, or low-quality images.

  • JPEG, PNG or WEBP.

  • Make sure to vary the background, poses, style, etc. Anything you don't want to be learned should be as varied as possible.

  • No transparent images.

Cardinality

How many images do I need?

This question differs dramatically whether you are finetuning a model or looking to produce LoRA. Finetuning is a lot more open-ended, and generally a "more is more" approach works.


For LoRA, things get more interesting. The vast majority of LoRA are trained using the DreamBooth architecture these days, and DreamBooth is so remarkable that it is able to produce decent models with as few as 3 or 4 images! That being said, the flexibility of the model (its ability to extrapolate new situations, themes or styles for your subject) will suffer if there are too few images. On the other side, the sensitivity, or strength of the model can suffer if you feed it too many examples, and those examples start to conflict or confuse the training. The perfect number is an impossible goal, but for LoRA training, these rough guides have served me well in over a year of model training now:

Note: The following numbers are per concept. If you want to train a LoRA on multiple concepts, multiply these amounts for each concept you want to train. The more concepts you train the less images you made need per concept, but these remain a good starting point.

  • 3 - 10 images - Minimum number of images for a simple subject and not too worried about flexibility.

  • 10 - 30 images - A good range for training characters or simple concepts.

  • 30 - 50 images - A good range for complex characters, clothing and concepts.

  • 50 - 100 images - A good range for styles and highly complex concepts.

  • 100+ - Diminishing returns. Yes, you can train with as many images as you want, but generally adding more images at this point has dramatically reduced returns, or can be actively harmful as it becomes harder to balance the dataset (see below).

Image size

What size images should I use?


Stable diffusion is pretty clever... while its UNet is trained on a fixed size (512x512 for SD 1.5 and 1024x1024 for SDXL), it's actually capable of producing images in any size or aspect (provided you've got the VRAM for the task).

The same is true of training. You can include images of any size in a dataset and SD will be able to train them, but there are some sizes and aspects you will not want to include as they can be actively harmful to the quality of the model.


Too small

If you include an image that is too small for the SD UNet, it will upscale the image (using a fairly simple upscaler, likely bicubic or something equally as horrible), resulting in the training of pixellated images.

Typically, a dataset can handle a few of these (particularly if you tag them as ("blurry", or "pixel art"), but too many and your model will tend to produce blurry or artifact-ridden outputs as a result.

For reference:

  • SD 1.5 - 512x512 or 0.25 megapixels

  • SDXL/Pony - 1024x1024 or 1 megapixel


The next point will cover more on aspect ratios, but it's worth pointing out here that images don't have to be square, or have a minimum of those dimensions per side... what matters is that images are not dramatically lower than the megapixel size (or total number of pixels in the image; height times width) of the model you are training against. This means you can have an image that's say, 432x576 and it will still have enough pixel data to work well with SD1.5.

Too skinny, or a bad mix of aspects

It's a common misconception that you can't/shouldn't train SD models on non-square aspect ratios. In fact, training on only square aspects can result in overfitting on certain aspects of composition, like placement of subjects in the frame... variety can be useful here. What you do want to watch out for is images that are incredibly skinny or tall, or too many differing aspect ratios that prevent the training scripts from bucketing and batching the images with other similar aspects.


Bucketing is a fairly complex, and is more of an optimization than anything, so don't worry too much about the distribution of your aspect ratios initially and only adjust later if you run into issues.

Too big

This isn't really a major issue, but you can run into some issues if your dataset images far exceed the typical image size for the model you are training against, it can be both a waste of disk space, but also time and energy as it needs to downsample the image prior to training, may produce artifacts and takes extra time to generate and cache the latents.

Image quality

Garbage in, garbage out.

Things not to include in the dataset:

  • Compression artifacts (jaggies, speckle, etc.)

  • Noise - clean up the images if you can.

  • Adversarial noise... this is a sinster one that image creaters can add to their images to screw with ML training. Beyond the scope of this document, but there are ways to denoise an image non-destructively.

Image Format

What's our vector Victor?

Stick to formats like JPEG, PNG and WEBP, and for lossy formats like JPEG use a high quality to avoid artifacts, as mentioned above.

Stable Diffusion doesn't currently support any vector formats, like SVG... so convert them to one of the raster formats above before training.

Variety and balance

How does the model know what to learn and what to ignore?

Generally speaking, tagging is the most important tool when it comes to directing the learning of your new model. Tagging can be complex and finicky, and is outside the scope of this document, but what we will touch on is the variety in the images themselves. Regardless of how well you tag up your datasets, concepts that are heavily repeated in the dataset will overfit into the model. This is a good thing for your subject, and a bad thing for things like the background, style, color balance, etc.

When you have assembled your collection of images that make up your dataset, the first thing you want to do is zoom out and look at the folder with thumbnails on, or some other way to view images and look for patterns.

Too many white backgrounds? Only 1 pose? No alternate angles? Backgrounds are all the same setting?

It's extremely difficult to guess exactly what the model will focus on and what it might ignore, but a good rule of thumb is that anything that appears too frequently in the dataset (other than your subject) runs the risk of overfitting into the output model.

Vary everything, as much as you can. If you can find enough images to use, make them. Inpainting and img2img are your friends here... but be careful... overreliance on AI-generated images introduces its own biases into the model...

Transparency

Stable diffusion does not like the alpha channel. If you have images with a transparent background you will want to photoshop a solid background (or get creative) into the image. Even one or two transparent images can ruin a model because of the way that SD views the colors of semi-opaque pixels.

Try popping a transparent image into img2img if you want to see what I mean.

33

Comments