Sign In

Useful online tools for Datasets, and where to find data

Useful online tools for Datasets, and where to find data

When making a Lora you may find it difficult to find the perfect image or it may be an issue with a watermark. This guide provides some info on data collection where to find them, how to clean them, and how to get them data set ready.
Finding them
I'm personally an enjoyer of anime so most of these will help you find anime style pics and just give some ideas on where to look to collect images for training. Will list both common methods and unique methods.
Some common ones for hunting are sites such as Danbooru and Safebooru. They are easy to navigate and have a huge tag list to assist in searching. There is also Gelbooru which is pretty much the same as the first 2. Another one that has a similar tagging system is sankakucomplex I discovered this on accident recently and it seems to have a lot of mature themes but may be worth checking.
Some sites that are a little harder to navigate are Pixiv, Deviantart, twitter, and zerochan. Deviantart you probably know it has a wide variety of images posted but every once in a while some decent 3d models or some artist set up shop there, but the quality of images is all over the place. Pixiv is like the japanese deviantart has a tagging system but it's more by series and character than actual features, I'll also link to the pixiv downloader which will make image collection there so much easier. Twitter is a pain to navigate without proper tools (which I will link to) but is quite popular if you are looking up data of certain artist. The reason why Twitter is a pain is because it deletes images from the search feed after a certain time but they are still on the site. To get around this if you know the twitter tag you can search for it on sites like buhitter which are made for searching or use the twitter media downloader to download the entirety of an artist page. Zerochan I discovered recently its like danbooru but even more japanese and has a smaller tagging list but should have some different pictures than the previously mentioned.
Some sites can have some decent images but are clearly meant for more adult content, an obvious one for this category is e-hentai. E-hentai is purely adult content but it does carry some game cg's some non hentai and artist picture packs, but your going to have to sort out all the adult content first. I don't have a suggestion for bulk downloading on E-hentai because I refuse to make an account but if your okay with it bulk downloading should be built into the site as long as you have an account. For other sites that may not have a bulk downloading option available you can always try using the Kellyc downloader extension just be sure to read the tutorial.
If you need raw screenshots from the anime itself but are worried about subtitles this is what you can do. I won't post a link to it directly but you can google where to download raw (unsubtitled) anime. Most sites of this nature require a torrent downloader I personally would recommend deluge. Once you got your raw anime playing you can take all the reference screenshots you want and no editing would be necessary.
For game artwork and sprites from various retro consoles try checking out the below resources. As recommended by OhGodItsBroken. The sites are organized by either console or series. Better for If you have a topic in mind rather than general browsing.

Cleaning them
The below tools work for all images not just anime. Since Ai is improving everywhere there are tons of free easy to use tools to assisst in image cleanup.

By Hand-
If its a super simple thing like a single line out of place or you need to remove a white border just do it by hand. For your by hand projects I suggest Gimp its a good paint tool it's free and has a wide variety of features to help you. You can also always try using gimp combined with stable diffusion itself as well and a good tutorial on this is linked below.

Infinite use no questions-
Image cleaner experte, is a ai image object remover. All you need to do is highlight and the section will blend into the rest of the image removing whatever was there initially. It's free and keeps original resolution, an amazing tool honestly. A bulk image flipper thats free and online is pinetools it also is super easy to use. If you have a lot of the same size image and need to crop them (for example to remove a banner) a good site for mass cropping is birme just enter the size of the pic without the banner and it can easily clean your images for you.
Infinite use some questions-
Okay this refers to sites like imgupscaler, imglarger, and watermarkremover. These tools are also good but need some finagling. The two main rules are use them in incognito and remove cookies from the page, I'll explain why and how. These sites are only temporarily free so if you want to use them on your entire dataset you simply make it seem as if its your first time every time. Go to the site in incognito and once done before the little lock in the search bar, click cookies and site data and delete all the cookies then you can use the site again (open article image in a new tab for a view of this). As for how to use them I suggest the following order: watermarkremover (removes watermarks), imglarger (removes image noise), and finally imgupscaler (increases resolution of the image (makes it bigger)).
Tagging is long and it's tedious but I personally recommend the extension below, its capable of doing bulk and easy to use. Personally I suggest using it's "wd14-convnextv2-v2-git" model from the drop down also I recommend to turn on escape brackets for your settings. When using the negative tagger remember to put the underscore since thats how all tags are originally ex: "long_hair". Positive prompts on the other hand can be tagged normally though ex: "long hair". This extensions is made to add to your sd setup through the extension tab.
GitHub - toriato/stable-diffusion-webui-wd14-tagger: Labeling extension for Automatic1111's Web UI
Hopefully this assist in your future data sets, any questions be sure to mention in the comments.