Datasets for Dummies—now even dumberer than ever! (Updated 12/06/23)

This is a quick and dirty guide written at the request of an adoring fan on how I personally curate and create datasets for training. It's become a bit less quick and dirty since its initial publishing, but I've changed my workflow enough to warrant a few updates.

The first step is, of course, acquiring a dataset. Exactly how you do this will depend on what you're training, but—generally speaking—a bigger dataset is better. I recommend scraping data from boorus if possible, since their images are (more or less) tagged already, but the more obscure the concept you're wanting to train, the more work you'll have to put into finding suitable data.

Don't know how to scrape a booru? LOOK IT UP. Myself, I go through a booru by hand and download what appear to be suitable images as sort of an initial QA screening, but there are methods for automating the process.

After acquiring your dataset, the next step is your first QA pass; remove duplicate files, watermarked images, low quality images, and other undesirable elements. Why? GIGO: garbage in, garbage out. We want to have a large dataset, but we also want it to be a quality dataset.

Once this initial QA pass is done, it's time for tagging! This is where most of your time will be spent finagling a dataset. If you used a booru, your images came pretagged; congratulations! If not, run your images through Stable Diffusion to have it auto-populate tags.

Edit: Future AInonymous here (take 2)! I've come to ~~kill~~ save Sarah Connor; I've also updated my processes and suggest AI tagging even a dataset collected from a booru, merging the files together, removing duplicate entries, and then proceeding with the rest of the guide. How do you do these things?

First, for the AI tags, I suggest using Holostrawberry's Dataset Helper, which can be found here:

https://colab.research.google.com/github/hollowstrawberry/kohya-colab/blob/main/Dataset_Maker.ipynb#scrollTo=WBFik7accyDz

Look up his guide if you want more details on its use. For the purposes of this guide at this stage, you only need to upload your images to Google Drive, then run cells 1 and 4 in the Dataset Helper. On top of the default blacklisted tags, add the following:

1girl, 2girls, 3girls, 4girls, 5girls, 6+girls, 1boy, 2boys, 3boys, 4boys, 5boys, 6+boys, multiple girls, multiple boys

We won't need the AI to tag these, since any decent booru will have this much done already; besides, the AI doesn't know how to count. We'll also blacklist:

brother and sister, brothers, sisters, siblings, incest

Because the AI has no way of actually identifying these. Maybe also trap, reverse trap, and crossdressing.

Next, for merging these tags, create a folder somewhere on your PC with two subdirectories; one for the AI-tagged set, and one for the booru-tagged set. Make sure the text files in each folder share the same name, and then, in the parent folder, create a .ps1 file that runs the following code:

https://pastebin.com/bFfmY6zZ

This will, for example, copy the contents of file1.txt in Folder A with the contents of file1.txt in FolderB, creating a new file1.txt in Folder C. Sadly, there will be some unwanted newlines, for which we'll use the following code (which is another .ps1):

https://pastebin.com/UAyRd6xG

This removes the final newline from each file. You'll still need to get the one that's used to separate each entry, but for that, you can just use Notepad++ to find and replace all newlines (\r\n) with a comma and a space (, ). There's probably a more elegant solution, but I'm not trying to become a certified Microsoft engineer, here—I just want to make anime titties.

Anyways, now that we've tagged the heck out of our images, upload those newly merged files to Google Drive and use Holostrawberry's Dataset Helper again, this time running cell 5—leave all the options blank, except to toggle 'remove duplicates' (and optionally 'sort alphabetically' for human readability'.

Now, continue on with the rest of the guide.

Now you're done, right? Wrong. AI is not perfect, and neither are the boorus. For the highest quality dataset possible, you must go through every individual file and confirm the tags for yourself, removing unnecessary or conflicting tags, and adding anything forgotten. This can be partially automated; for example, using Notepad++ to search through all files and remove 'creator:', 'character:', 'series:', and to mass remove or replace other tags as necessary. But again, do not think you are done after this; you still must manually check every file for the best possible output.

Personally, I recommend tagging for absolutely everything; every article of clothing a character is wearing, the color of their eyes and hair, etc. Some guides suggest otherwise, but tying a concept to a single keyword reduces its flexibility.

And then, it's time for.. another QA pass! If you're smart, you'll always be checking your dataset to see if there's some garbage data you've previously overlooked, but if you haven't, then stop and do it now. Even if you HAVE, stop and do it again.

Edit: Future AInonymous here again! The following paragraphs about cropping and resizing only apply to embeddings/textual inversions. You do not need to do this for other models, nor should you.

Now that everything is tagged and checked for quality, the images need to be resized and cropped! At this point, I suggest putting everything into a tidy, little compressed file: rar, zip, or whatever else you prefer. This is not strictly necessary, but it will allow you to reuse the dataset in the future for other AI models that may release. SDXL, for example, works best with 1024x1024; if you crop everything to 512x512 before making an archive, you'll have to repeat the above steps to create a new dataset. So, if you instead create a backup while everything is at its original quality, you'll have less work to do later on. Of course, you should also archive the processed dataset, too.

Moving on from that tangent, it's time to crop those images. Fortunately, this process can be largely automated by firing up Stable Diffusion and setting the appropriate parameters, but—you guessed it—we still have to manually confirm everything afterwards. Again, AI is not perfect, and it will make mistakes. If an image comes out poorly cropped, typically by focusing on the wrong part of the original file, then you'll have to manually crop it yourself.

Even before you fire up a cropper, I suggest going through the dataset and manually adjusting images with simple backgrounds in order to retain the full image (just make the image a square, then resize for whatever model you're training it on).

So, everything is tagged and cropped, which means it's time for... another QA pass! I'm sure you're tired of this by now, but your diligence in this matter helps to ensure a good final product.

Finally, once you're certain you've done everything you can to create the finest dataset man has ever laid eyes upon, it's time to train it. After everything you've done so far, this is the easy part! If you're unfamiliar with the process, there are plenty of guides for training models; unlike dataset curation, I don't do much different than what everyone else does. One thing I will suggest here is to aim for a higher number of steps than what most guides will estimate, since we're aiming to have a larger dataset than normal, and make sure to save the intermediate checkpoints, as well.

Now, you have a trained model. Think you're done, right? That this guide is finished? WRONG! DROP AND GIVE ME 20 QA CHECKS, MAGGOT!

Seriously though, put your newly trained model through its paces and make sure you're satisfied with it; you may need to adjust your dataset and retrain it if there are any issues. It's not uncommon that a model will have a strong preference for something, like a pose or a hair color, in which case you'll need to recheck your dataset to see if the related images have been properly tagged, or maybe to trim some of those images from your dataset.

Repeat this last step until, finally, thank the gods above, you have a result that you're happy with. You could try to skimp on the processes, to release a subpar model and say 'good enough', but at the end of the day, the man in the mirror will judge you, so be prepared.

And finally, archive it! That is some QUALITY DATA, and you may be able to reuse it someday! Otherwise, you might have to do all of this again, and that would simply be a waste of time.