Hi, urbanlegendwiki here. This article is about sources for finding images and tags/text for your datasets, whether it be for a trained checkpoint, a LoRA, etc.
Hope you enjoy! Made with Love by urbanlegendwiki.
Mentioning credits, first!
I have to thank guy90, for this article. It had inspired me to make this article, as there are many image sources listed, in the "Data Collection:" section. If the article didn't exist, I cannot create it. Thank you!
List of Image Sources
Search Engines
Search Engines don't only provide information, they also provide images & videos from websites. They don't have a blocking feature that prevents people from viewing/accessing content, making it a highly recommended image source.
Baidu: Baidu is a Chinese AI search engine. You will want to translate the site into English, as it's Chinese, a language that some people don't understand.
Bing: Bing is a search engine that can yield fruitful image search outcomes.
Google: Google is a search engine that provides relevant image results.
Startpage: Startpage is a search engine with a built-in proxy.
Yandex: Yandex is a search engine known for displaying images that may not be shown by Google or Bing, making it a useful alternative.
Websites
DeviantArt: DeviantArt is a popular art site with many talented artists. Visit this site at your own risk, because it is famous for hosting fetish art.
Hairstyles Library: Despite the name, it is actually an art site that hosts art. The downside is there are pornographic content on this site.
ibisPaint: An art site that hosts art made with ibisPaint (X). There are many talented artists and wonderful art.
Pinterest: A social media platform that allows users to share and discover ideas. There are many images on Pinterest.
Pixiv: A Japanese online community for artists. It is hard to navigate without knowing Japanese, but you can translate the site.
Sketchers United: An art site with many talented amateur artists with simple artstyles, and tags for finding images you want. This site can be great for finding amateur arts for a dataset to train a LoRA for amateur artists' styles.
Twitter: Many of the images on Twitter do not appear on Google indexed, so having an account to search for images and fanarts should be essential.
Wikimedia Commons: A free, online repository of images and videos licensed under Creative Commons and Public Domain. The most common kind of image you can find in this website is photos.
Video Websites
Since most AI image models are only allowed to have unanimated filetypes in dataset (e.g. PNG, JPG, etc), you might need to screenshot the videos to be able to use it for a dataset.
Dailymotion: A video-sharing website that allows users to browse videos. It's similar to YouTube, but Dailymotion's content focuses more on professionally-produced videos, while YouTube is primarily user-generated.
Kwai: A chinese video sharing platform that hosts shorts.
TikTok: A popular video sharing website where you can find shorts to screenshot.
YouTube: A popular video sharing website where you can watch online videos. There are alot of creative designs in the website, but sounds that are disruptive to most people (like the kids cheering sound effect, the baby crying sound effect, etc.) are commonly used in YouTube videos, meaning that if you have misophonia, you'd wanna turn the volume down/off.
Boorus
Boorus are imageboard sites where images are categorized with tags. I highly recommend using boorus as a source for finding images for your dataset, since boorus most often don't have a blocking feature that prevents users from seeing/accessing others' content.
Danbooru: A booru that focuses on anime images and has an extremely immense myriad of images. The downside is that there are many explicit and questionable images, so you might either wanna blacklist such content or use Safebooru if you don't want to see NSFW content or want to make a SFW model.
Derpibooru: A Philomena booru that focuses on My Little Pony images.
E621/E926: A Danbooru-inspired booru that has a myriad of furry images, which is something that it's hosting. E621 is a site that shows NSFW content, and E926 is a site that shows only SFW content.
Furbooru: A Philomena booru that focuses on furry images.
OSC Booru: A booru that focuses on OSC images (e.g. BFDI, BFB, Inanimate Insanity, etc.)
Rule34.lol: A Rule 34 site that hosts pornographic and obscene content, although most of these images are high quality.
Rule34.xxx: Rule 34 is a site that has many high quality images. Visit/use this site at your own risk, because it mainly focuses on pornographic and obscene content.
Safebooru: A SFW-only booru that focuses on anime images. You can use this as an alternative for Danbooru, if you are uncomfortable with NSFW, want to make a SFW model, or just don't want to see it.
Screamer Gallery: A booru that hosts screamer images. The site contains uncensored screamer images, including unnerving images, so I recommend proceeding with caution.
Fandom Wikis
Fandom wikis typically have high-quality images, uploaded on both the wiki and forum mode. If you want to find other wikis, search for a Fandom wiki or type the URL of the Fandom wiki (replace "www" of https://www.fandom.com/ with the url of the Fandom wiki (like https://createnewwiki.fandom.com/)
Plants vs. Zombies Wiki: A wiki about a tower defense game called Plants vs. Zombies. You can find high-quality artstyles on the forum mode in this category.
Torrenting Sites
Torrenting sites are sites where you can download files through the BitTorrent network.
These are the important things you should remember about torrenting sites:
BE careful when you download torrent files, as there could be malicious torrent files. Check the review/comments to find out if a torrent file is safe.
Use a VPN when you download torrent files.
Torrent files can be copyrighted. Using torrent files to download copyrighted material could be against the laws of most countries. That being said, you are less likely to get sued for using official art.
I will make a list for it. Stay tuned.
Miscellaneous
Decals (Roblox): Roblox has many creative digital media and high-quality photos posted here, and this is the page where you can find the images.
Image Galleries (Soundeffects Wiki): Soundeffects Wiki is a Fandom wiki with a bunch of screencaps screenshotted at the time sound effects occur or are heard, making it a perfect wiki for finding screencaps and screenshots. Some screencaps/screenshots contain onomatopoeia (words associated with sounds), and can be used as dataset images for onomatopoeia concept models (e.g. a LoRA for a cat meowing with the "meow" onomatopoeia)
A List of Lists of Image Sources
Art Websites (Wikipedia): A Wikipedia category that is for pages on Wikipedia about art websites. Most of these pages have external links to the websites, in the "External Link" section.
Awesome-Booru: A GitHub list by celriseup. There are many boorus listed here (including dead websites).
lxfly2000's list: A GitHub list by lxfly2000, of Search Engines, Boorus, and Art Platforms.
Top Boorus list (Booru Project): The page is a list of Boorus, and there are a plenty of Boorus listed here. Most Boorus listed run the Gelbooru engine.
Features, Pros, & Cons of Sources
DeviantArt
DeviantArt is an art site with many inspiring talented artists. It features photography, videos, and artwork.
Features
DreamUp (AI, you can use it if you can't find more images)
Searching
Interests (when you make an account, it asks you for what interests you are interested in. You can pick some and it will provide you images with your interests).
Pros
Plenty of talented artists
Typically has vibrant art
Cons
You are likely to get blocked without doing anything wrong
Sketchers United
Sketchers United is an art site that was founded by former team members who made Sony Sketch. You can find amateur simple artstyles and wonderful talented artists in this site.
Features
Tags (for searching)
Image channels
Collaboration
Pros
May have a small amount of explicit porn images (it doesn't allow porn), and can be a great source for finding images of a character for models made to generate a specific character.
Cons
Has a blocking feature that prevents users from viewing your content. It is common for users to block you without you doing something wrong/unacceptable.
Doesn't show sensitive images in tag pages, if you're not logged in (so having an account for finding dataset images here would be essential).
A plenty of artists here don't allow their art to be downloaded. You have to keep your dataset completely private if their art is there, to prevent them from knowing you downloaded their art.
Some artists don't allow their artstyle to be used/replicated, and this site can punish you for replicating one's artstyle (if they don't want them replicated). Presumably, there are obscure countries where artstyles can be copyrighted/trademarked (it may sound awkward, but I have this belief since some countries have unique and awkward laws).
Art on this platform may have obtrusive watermarks on them, and removing obtrusive watermarks may take so much time that it's annoying.
Twitter, also known as X, is a social networking service where users can share short text messages, images, and videos.
Features
Posts (called Tweets)
Built-in AI
Pros
Has many professional artists.
States they will use art posted here to train a built-in AI, meaning that artists who post their art there and want to keep their art there don't mind their art being used for AI.
Cons
Some artists on Twitter may choose to delete their art to prevent Twitter from using their art for AI.
What features are useful in dataset sources?
Blacklist/Filter: A blacklist/filter lets you hide the images you don't wanna see. You can, for example, turn the "Show NSFW" setting off when looking for dataset images for your character models.
Prohibition of Porn: This rule is common in social media and can be useful for when you're looking for dataset images for your character models. Porn images in dataset will affect learning, unless it's cropped. However, this rule may actually be useless, because it is still possible to post porn on sites that don't allow porn (don't do this or else you may get punished).
Collection/Pools: Collections can be found in art sites like DeviantArt, and pools are collections that can be edited by their creator and other people like wiki pages. They are extremely helpful/useful, incase tags fail you. Some sites forbid adding peculiarly specific/overfitting tags, and images can also be incorrectly tagged, which is why collections are very useful/helpful. Collections cannot be edited by people other than the creator and collaborator, especially in art sites, and that can prevent random unrelated images from being added to the collection. Unfortunately, in most art sites, there's a chance collections don't have the images that you want, and in most art sites, you cannot add images to the collection if you didn't create that collection or are not a collaborator of that (this means you might want to make your own collection especially in an art site where you cannot add images to a collection unless you are the creator/a collaborator of it).
AI Usage Prohibition Disclaimer: This is actually not a useful tool for gaining dataset images. Rather, it's something you have to be careful of when wanting to get dataset images from an image source. If an artist states that they don't want their art to be used for AI, then don't use their art in your training dataset and respect them. However, it is actually hard for them to verify if you used their art in your training dataset, no matter if you stated that you trained your models on their art (since it is possible to lie/pretend/make jokes).
What are the useful services/tools you can use?
Boorusama
Boorusama is an unofficial, feature-rich client for boorus. You can install Boorusama with Play Store.
Pros:
Has a bulk download feature (you can download many images with this feature).
Cons:
Is only easy to install on Android & mobile devices (have to use BlueStacks if you're using PC).
Doesn't (and won't) add support for sites that aren't boorus, even art sites. (proof).
Grabber
Grabber is an imageboard/booru downloader with lots of image sources.
Pros:
Supports many image sources.
I can't find images!
Don't worry! There are solutions for that!
Try other languages: Pixiv allows tags of non-English languages, so if you want to try other languages, you might wanna use this platform. For example, if you didn't get enough images of Peashooter from Plants vs. Zombies, you can try "豆鉄砲" (Japanese) or "豌豆射手" (Chinese). You can try other languages too, to see if you're lucky.
Use a booru: If you used a non-booru website (e.g. DeviantArt) and search something specific like "an anthro female cat wearing a blue shirt and green pants", the search results may not be accurate. You might want to use a booru website to search a specific thing instead of DeviantArt, as the search results are very accurate. This is because images from boorus use searchable tags, and they can have unlimited tags. There is a negative point of boorus, though. Images can have mismatching tags, which may be due to incorrect tagging or vandalism. So, if you search up "solo, anthro, female, cat, clothed, blue_shirt, green_pants" on E621, you can still sometimes see a drawing of a feral cat or a painting of an anthro fox.
Use another site: If you can't find many images of a thing, you can use another site. If a thing has a few images on a certain site, it could be due to rules, DMCA/copyright takedowns, deletion of images, etc.
Use Yandex: Like I said, Yandex is a search engine than provides images Google or Bing doesn't show. If Google or Bing doesn't show you an image you want, you can use Yandex.
Do you accept requests/suggestions for adding sources?
Yes! Have you ever found an image source that I missed? You can request/suggest an image source that I'll add to the list.
What images should I not use?
You should not use images for datasets in a manner that would make one know it's used in the dataset, if:
the creator of it doesn't want it to be used for AI.
it's licensed under CC-BY-SA (stands for Creative Commons Share Alike). This is because models cannot have the same license as images.
it's licensed under CC-BY-ND (stands for Creative Commons No Derivatives). This is because the model would replicate the images in their dataset.
You should not use images for datasets of your model that would be used for commercial purposes in a manner that would make one know it's used in the dataset, if:
you don't have the rights to commercially use the image.
it's licensed under CC-BY-NC (stands for Creative Commons NonCommercial).
You might say "It's tricky for people to know if their image has been used for AI", and that's right! If the dataset is large and private, it is tricky to know what images are in the dataset, even if you mind AI models' datasets using your art without permission.
This is because the more images a dataset contains, the more likely the model will generate images distinct from the dataset images. Also, if a dataset is private, only you and anyone using your device can see your dataset.
Yes, the more dataset images, the more distinct the generated images from the dataset. I trained a LoRA for Wubbox from My Singing Monsters, used it, and the output images didn't look much like the dataset images. I trained another LoRA, this time for a Water Island Epic Wubbox. The example images of it looked much like the dataset images (there were a few images in the dataset of that LoRA). So, models can actually generate images distinct from the dataset depending on the amount of dataset images.
Some artist's style LoRAs have a description saying it is trained on the artists' art, but that doesn't make verification easier, because misinformation exists. Most artist's style LoRAs also have a private dataset, making verification even more difficult and tricky and making the artist believe that it's misinformation.
One might also think that it is easy to verify if your art has been used for an artist's style model, because the model can replicate your artstyle. It's untrue, because it's possible for someone to imitate another artist's style.
Metadata viewers are the closest thing to help someone know if an image has been used, but they may not be completely accurate. For example, metadata viewer can provide missing details about a model. Inaccuracy of metadata viewers hinders verification.