Sign In

Zero f***s given cherrypicked and 100% tagged dataset gathering | CherryScraper and DPTs

Zero f***s given cherrypicked and 100% tagged dataset gathering | CherryScraper and DPTs

Intro

You like training, or want to get a good start, if you're here. Well, i think i got you covered.

We'll go over some struggles that come with scraping only images you like, cherrypicking data and tagging hell.

Ever find yourself browsing Boorus to cherrypick images?

Then this is for you.
*DPTs stands for Dataset Processing Tools

Struggles, suffering and hatred

Have you ever gathered over 1000 images just to then go through all of them, either after half-assed autotagger work, or completely from zero, to get most accurate data for your own usage? I have. Both variants. I can't emphasize enough how painful it is to manually tag each image, if that is required by your particular usage case...
I spend ~hour per 50 images on average, if it requires manual tagging.

Since i got access to GPT4, i was developing some useful tools that help me in fixing dataset errors, and adding new information based on various metrics. One that i will share with you today(apart from Chrome extension) has shown improvement of trained model editability in my few tests. It removes unwanted tags from text files. I have some presets of tags, but your mileage may vary.

I will suggest you to go over ~11k tags provided in it yourself, will take maybe hour or two.

Those tags are taken from WD1.4 tagger and are not extensive enough to cover whole Danbooru tag database, but it will cover most frequently used. It's very fast, and is suitable for usage in datasets numbering over 10k, but i would suggest to use extensive filtering with the list of whole Danbooru tags, which you can find somewhere in your A1111, as it is downloaded with autotaggers.

Speedrunning datasets

Well, if you trust me and reading so far, go and install my extension.

https://chrome.google.com/webstore/detail/cherryscraper/dmgibfaenicepbmjcbejibgbaohfkido?hl=en-GB&authuser=0 google finally approved it, now you can install from official google web store! Also worked on Opera!

Firefox version is also finished and submitted for checks(17/06/2023), expecting to have official Firefox version soon. Meanwhile you can get it on Github.
https://github.com/Anzhc/Cherryscraper/tree/main

Chrome web store checks are taking too long, so i release it on github. Finally, officially out.

Then you might want to go to my github and download my Tag Filtering tool that i described above.
https://github.com/Anzhc/Anzhc-s-Dataset-Processing-Tools

I will add more tools as time goes on, but let's start with this. Just download files, instructions are inside.

Fixed issue with unwanted mode, now works.(16/06/2023)

After you've got everything, go to the site of your choice, from supported ones, extension will pick up on what site you're on and switch automatically, if it's supported.
Important supported sites: Danbooru, Safebooru, rule34, e621
There are ~4-6 more, and will be expanded in future.
Open image you like to test, precc "c" and look at the result. It should download image file and create relevant .txt with tags alongside it in "dataset" folder in your browser downloads folder, dataset is default folder, you can change it to anything you like.

If it worked - good, it should, i did use it after all. Now you can go through tens of pages, opening images you want and then switching to first of them all, pressing "c", and then closing it almost instantly, which will switch you to next one, repeat as much as you need.

Extermination of excess

It's practically ready to train, but still will have irrelevant metadata tags, and artist/character. If you don't need them, you remove them by using Tag Filtering tool, as it removes characters and artists(if you're using wanted tags mode).

I am including 2 modes: Unwanted and Wanted tags.

In Unwanted mode it will remove tags that are specified in respective list.
In Wanted mode it will remove all, but tags that are specified in respective list. Use this for scraped images. (Keep in mind that it will remove almost all japanese tags as well, as i don't know meaninig of 95% of them)

There are 3 general options to each of them:

SFW(0) - removes most NSFW tags as well, this is useful, if you train SFW dataset, but fear that autotagger could imagine something lewd.
NSFW(1) - removes(or keeps in Wanted mode) tags, but excludes NSFW tags as well.

Data duplication(2) - Removes redundant tags, that are often present. For example:

If you have shirt and white shirt in the same image, it will remove shirt tag, as it's redundant. It works only on exactly matching words in tags, and it doesn't work on the tags of the same power, so your multi-character images are safe, it will not remove red shirt if it's present in the same image with the white shirt.

Keep in mind that Mode(2) is always on, and will be working alongside tag remove in modes 0 and 1.

Final Touches

Don't forget to load them in Dataset Tag Editor after that to remove tags you don't want to see, if it's not a fine-tune. If you're doing fine-tune - you're free to go training right after tag filtering, or even without it, though, be aware that most images will have total token length far exceeding 75, and sometimes 225 tokens.(Wooohooo! 100% tagging-free dataset, amirite???)

For characters - remove traits that are always bound to character appearance(most should be in the top 10-20 tags)

For style - remove things like watercolor, if such appear in dataset. But unlikely, and you can continue as you're doing a fine-tune.

Showcase

Here i'll show just an example of one training, that was not even done in Kohya_ss(which is deemed a superior trainer by most, i think). I will also provide a sample model for you to test it yourself(strictly for research purposes). In this case i trained Rem.

Banner, of course :D

(Yes, eyes are red, because i used very high weight red theme tag)

And just common Rem images, various most popular models, and unpopular ones, like my own xd

Prompt for all images that are not abstract, from behind, or having armor:
masterpiece, best quality, rem, 1girl <lora>

with negative:
(worst quality, low quality:1.4)

It performed extremely well not only across wide range of models, including less anime-like, but also in more complex poses(which i will not show, as they are a bit nsfw), and shows high degree of editablity, which is shown in very first images. It also shown good compatibility with other loras(for example S12). I might also note that consistency is high, not in terms of having character appear at all(it appeared ~100% of times in generations), but in terms of outfit consistency, such as hair ornament. It doesn't seem to tend to appear on the wrong side(i don't recal seeing that), if whole head is in view, but also has a high chance to be properly rendered if the proper side for it is not in view, as shown 2 images above(not rendered at all, in case of that image).

If you want to test it yourself: https://civitai.com/models/82259?modelVersionId=87348

Models used in no particular order: ACMAR G Mix, ElusiveDreamS Full(not released), ElusiveDreamS OHalf(released), ChilloutMix, RevAnimated, Counterfeit3.0, CetusMix, NeverendingDream. (I might've forgotten one or two).
Most are using first 3, as they are my own models, but rest were included for research purposes.

Dataset Information

This dataset consists of 125 images, processed with Tag Filtering Wanted mode script, and then i manually removed tags are that relevant to character, like blue hair.
This took around, or under 20 minutes.

I shall remind, that this is a cherrypicked, completely properly tagged(to a point), and sufficient dataset for training character without any repeats(which are used in Kohya), or edits in to multiple images. And i wasn't very speedy about it, you can do better.

Instruments(again)

https://github.com/Anzhc/Cherryscraper/tree/main or https://chrome.google.com/webstore/detail/cherryscraper/dmgibfaenicepbmjcbejibgbaohfkido?hl=en-GB&authuser=0

https://github.com/Anzhc/Anzhc-s-Dataset-Processing-Tools

Ending

Can't wait for you guys to try that stuff out, it already saved hours of my life just from using it to train Rem for this article.

Of course, it's not very useful, if you want to train just very large single tag, for that it's better to use proper automatic scraper, though, if you're willing to look at thousands of images yourself - my extension is a choice for you, even in large scale trainings.

Addendum

I also ran a multi-subject training with that tagging, 5 characters, to be exact, 50 images each, it managed to give me pretty good result, especially for just 50 images in multi-subject training, but i didn't remove all relevant tags it seems, so i need to specify a bit more description for some of them. As i understand, Extended LoRA in interface im using is LoCon for everyone else, so, i guess you can consider this way of tagging more than sufficient for multi-character trainings without having to use LoHa(which is good for concept separation, as i understand, at least from what i have barely heard.)

It is likely to be extremely performant for fine-tuning tasks, even if dataset initially is not 100% tagged, as i've found out later. Though, obviously you should tag outliers that end up with little to no tags manually, and add additional data separation as you see fit.

Im desperate and poor, pls buy me a coffee or smth.

I had to spend 5$ to apply as developer for Chrome Web Store, and im already poor, so, if this extension and my tools save you even an hour of pain, i would appreciate if you spend what you're likely to earn in 20 minutes for donation to me <3

I need money to support my subscriptions to tools i use to develop stuff and create datasets.

https://ko-fi.com/anzhc

https://www.patreon.com/anzhc

77

Comments