santa hat
deerdeer nosedeer glow
Sign In

The Fully-Automation of the Three-Stage Cropping Process for Character Datasets


Updated on 2023-8-29:

Another successful try is here:

The former one performs much better than the latter one in details.

Updated on 2023-8-28:

GOOD NEWS on training automation.

We have done an experiment, training the LoRA of character Ithea Myse Valgulious:

All the training takes about 15 epochs.

This is how this character actually looks like:

This is the result of the first LoRA (native training):


And this is the result of 2nd LoRA:



It's not hard to notice that in the second LoRA version, there's a significant improvement in the overall fidelity of characters, and details like eyes, ears, and hair have been better preserved. This indicates that the 3-stage-cropping method is indeed an effective approach to enhance model training quality, and we have successfully automated this method, producing a LoRA that potentially surpasses the quality of all previously trained LoRAs.

Perhaps, we are truly ready to start preparing a massive update for our existing models!

About the Full Automation of the Three-Stage Cropping Process for Character Datasets

In a previous article, we mentioned that the three-stage cropping method (full-body, upper-body, head close-up) for character images can effectively enhance the training quality of the model. This can be explained in technical terms as the repetition of focus areas in the dataset, effectively adding weight to these areas, thus resulting in better facial and upper-body details. Moreover, during manual LoRA training, it's often challenging to prepare hundreds of images for training (in reality, there are usually no more than twenty to thirty original images). Therefore, this method can also mitigate the potential issues caused by small datasets, such as overfitting.

To achieve this, we trained models for full-body and head detection on character images a long time ago. Recently, we completed the training of a model for upper-body detection in single-person images as well (online demo). Building on this, we integrated the three aforementioned object detection models, ultimately achieving full automation of the three-stage cropping process. Here's (an example) we provide, much like the image below:

As you can see, each character in the scene has been detected and cropped individually in full-body, upper-body, and head close-up segments. This process has been successfully implemented and will be used for processing new model datasets. For those of you who come across this and are familiar with Python, you can check out the code on this Hugging Face space and use it for local batch dataset processing.

Additionally, we've noticed that characters with fewer images might suffer from a higher proportion of low-quality images, which significantly impacts LoRA's training quality. Therefore, we're considering a new fully automated training method:

  • In the previous approach, we directly trained LoRA using 200 character images (filtered out sketches, monochrome images, line art, unrelated characters, etc.).

  • In subsequent automated LoRA training, we could use a relatively smaller number of images (e.g., 100 original images), which, after undergoing the three-stage cropping process, would result in a total of 300 images for training.

However, it's evident that some image websites don't support sorting by image quality, so we need a model that can evaluate image quality. Regarding this model, we've started data annotation, involving annotators with backgrounds in art or design-related fields to ensure the dataset's quality.