Softwares to collect and prepare datasets

Here i will recount softwares I use for preparing datasets for training my loras.

My remarks apply only on artist style loras for pony, which is my current focus.

1. Collection of images

For any lora you need to collect images obviously, but artist images likely spread over multiple sites where they reside.

https://github.com/Bionus/imgbrd-grabber - very good to collect from booro like sites, e621, kemono etc. Especially if to download with all or specific tags so I do not waste time to caption in character names.
Discrub - chrome extension to grab stuff from discord server you are on
WFDownloaderApp - looks sketchy af, but it works where other scrapers failed
If it is all gifs and webm's I use ffmpeg with batch file with command:
for /r %%v in (*.gif *.webm) do ffmpeg -i "%%v" %%~nxv%%04d.png and then remove too similar images with czkawka, or just go and do snaps in vlc manually if there is not that many animations

2. Images preparation

https://github.com/qarmin/czkawka - godsent to easily remove duplicate images, or when you want to remove images that are similar enough.
Image editor of your choise - remove unnecessary stuff from images, like excessive text (why train on small text if sdxl cannot reproduce it anyway?), split complex images into simpler ones. Especially split if images in comic depict different interactions of the characters as this will result in added instability. You wouldn't want to add tag "sleeping" to dancing party.

I use the rule that if I cannot understand what's going on in the image at glance I need to split it or discard, if it's hard to me it's barely possible for training. Bad quality image - discard. Style unaligned to desired one - discard.

If you need to remove text or specific watermark/pattern ~~wink-wink, nudge-nudge~~ from images read these:

Using articles above I now can train yolo8 model for object detection. Takes about 5 min to create bounding boxes for 20-25 images in labelImg and 10 min to train model. Then I create masks using this model. And with images and masks i throw it all in ComfyUI flow to inpaint masks. Not a golden bullet but silver one for sure. On 200+ images massive time saving. Probably can be automated further but I'm not that savvy atm.

For marks that can be rotated on the images I create dataset with Label Studio and train yolov11x model and then use mangled script from article from above to generate masks. Obb also trains about 2-3 times longer.

3. Captioning

https://github.com/jhc13/taggui - very good to work with the tags and has auto-captioner with downloadable models like Vit3Large, llava, kosmos. I use only Vit3Large atm as I had limited success with internlm and llava (too innocent or censored for my degenerate tastes).
Currenty i use this with SmilingWolf/wd-eva02-large-tagger-v3 interrogator model for auto captioning. (Standing on the shoulder of giants)
https://github.com/Particle1904/DatasetHelpers/releases - has autocaptioning as the one above, not as easy to use but has killer feature for auto-captioning: redundancy removal, where more descriptive tag consumes more generic one, for example: "tail, dragon tail" -> "dragon tail". Also there is [experimenta] tag consolidation feature, but works funny sometimes, for example: "purple hair,orange hair,striped hair,multicolored hair" -> "purple orange striped multicolored hair", yeah that was two characters in one image.

There is also article from the guy: https://civitai.com/articles/2079/dataset-all-in-one-tools-windows-and-linux

I tried captioning images with simultaneous use of Joytag + vit + swinv with thresold >0.8 in ComfyUI, with tag filtering and deduplication, it felt good at first but now it feels like it was a footgun. Joytag is good for character names guessing but not as good as Vit otherwise for me.

4. Training

For training I use kohya_ss, there is nothing to it, and also multiple guides on the matter. And I have no Idea what am I doing. I tried to use OneTrained but got only noise every time, no idea why.

After that I receive file of >600 Mb size and I resize it with 0.97 sv_fro to reduce size to ~300 Mb.

I would like to know more about tools that you use to do either of the steps above, share so I can try it and maybe incorporate it into my workflow.

Softwares to collect and prepare datasets

1. Collection of images

2. Images preparation

3. Captioning

4. Training

Comments