santa hat
deerdeer nosedeer glow
Sign In

[2023-8-31] Release of v1.4 Training Automation Process


Long time no see! Let me update you on the latest developments:

In a nutshell:

  1. We have launched version 1.4 of the fully automated LoRA process.

  2. We've implemented an algorithm for extracting and clustering character images from anime videos.

  3. For the v1.4 process, we conducted tests on both the web-based dataset (collected and cleaned from image websites) and the anime-based dataset (generated from anime video keyframes using target detection algorithms). Both datasets yielded impressive models:

    • For the web-based dataset, the v1.4 process produces LoRA models with significantly improved detail quality compared to previous versions, while maintaining sufficient generalization.

    • For the anime-based dataset, v1.4 achieves extremely high fidelity on large or massive datasets, preserving a high level of generalization.

Dataset Scale

Let's define datasets based on their scale:

  1. 1-5 images: Few-shot dataset

  2. 5-20 images: Tiny dataset

  3. 20-60 images: Small dataset

  4. 60-150 images: Medium dataset

  5. 150-350 images: Large dataset

  6. 350+ images: Massive dataset

As far as we know, the majority of manually trained LoRA datasets (including those widely considered to have high quality on civitai) fall within the tiny to small dataset range. In fact, collecting and processing datasets of medium size and above through manual efforts is usually impractical due to the tremendous amount of human labor required.

However, when it comes to character datasets extracted from web images or anime videos, these limitations are less significant. The main challenge lies in some characters having fewer images available online, often due to lower popularity, or in cases where characters have limited screen time in the anime, leading to a scarcity of usable images.

About the v1.4 Process

Process and Versions

It's important to clarify that the versions (such as v1.0, v1.3, etc.) associated with the models released by this account do not refer to individual model or character versions. Instead, they represent the process versions used in the automated training pipeline for the models. In simple terms, all models labeled as v1.0 were trained using the same process, and similarly, models labeled as v1.3 used a distinct but consistent process.

Let's briefly describe the currently available process versions:

  1. v1.0 Process:

    • Dataset Source: Character datasets automatically collected and cleaned from various websites (including zerochan, anime-pictures, danbooru, and over a dozen more), capped at 200 images per character (many characters have more images available), with no additional augmentation.

    • Training Approach: NAI model training; all images resized to 640x880 for training; fixed training steps at 1500 regardless of dataset size.

    • Preview Images: Generated using the anything-v5 model; prompts for preview images are mainly clustered based on dataset tags, plus 1-2 general prompts.

    • Most of the models previously uploaded by this account belong to the v1.0 process.

  2. v1.3 Process:

    • Dataset Source: Same as v1.0.

    • Training Approach: Trained for 12 epochs regardless of dataset size; other aspects remain the same as v1.0.

    • Preview Images: Generated using meinamix_v11; additional prompts for changing clothing (miko, maid, suit, yukata) and NSFW prompts to test model generalization.

    • This is the result of the first round of process improvements, showing some level of quality enhancement.

  3. v1.4 Process:

    • This is the latest process, and the focus of this article.

To visually demonstrate the quality of LoRA models produced by previous processes, let's take a look at a few images.

The following examples roughly represent the typical quality of models produced by the v1.0 process:

Similarly, the following examples roughly represent the typical quality of models produced by the v1.3 process:

Extraction of Characters from Anime Videos

Firstly, the v1.4 process now supports LoRA training on characters from anime videos. It involves a complete automated pipeline from the original video to a character dataset, as outlined below:

  • Obtain the magnet link or torrent file for the anime video resources.

  • Automatically download video resources to the cluster.

  • Automatically extract keyframes using anime video keyframe extraction techniques.

  • Automatically capture all characters from the keyframes using AI techniques like object detection.

  • Automatically clean the data.

  • Automatically cluster the characters based on the extracted CCIP feature vectors.

  • Automatically package the clustered results and upload them to Hugging Face.

An example of an uploaded dataset can be found here, as shown in the image below:

You can observe that nearly all character images have been extracted.

However, it's worth noting that the current clustering algorithm is still not perfect (in reality, CCIP was mainly trained on illustration data and performs less effectively on anime videos), which may result in some impurities and confusion. The packaged character data is not guaranteed to be 100% accurate. Nevertheless, this is not a major concern, as the errors are well within an acceptable range, and subsequent secondary refining can effectively eliminate them.

After this, all that remains is to perform secondary processing on the character data package and associate the index (the leftmost first column) with the character name, as demonstrated here:

v1.4 Process

Speaking of the latest v1.4 process, let's discuss the main improvements made:

  • Dataset:

    1. Implemented a 3-stage cropping approach (full body - upper body - close-up of head) for characters on top of the existing dataset.

    2. After removing small-sized images, saved the dataset as three separate copies.

    3. This means that for the original large datasets with 200 images, the number often increases to around 500 after expansion, forming massive datasets.

    4. An example of a processed dataset can be found here: Dataset Example

  • Training Approach:

    1. Trained using the NAI model.

    2. Clustered all images based on aspect ratio into several buckets used directly for training.

    3. Defaulted to train for 15 epochs (for the mentioned massive datasets, this means over 7000 steps, taking up to 45 minutes).

    4. For small and medium-sized datasets, at least 3000 steps are trained (micro datasets are currently not within our scope).

  • Preview Images:

    1. Still generated using MeinaMix V11.

    2. Designed an evaluation metric for character fidelity (called Recognition Score or RecScore, with range of 0.0-1.0, 0.0 means not similar at all, 1.0 means 100% percent similar to dataset) based on CCIP. This involves using CCIP to compare the batch-wise recognizability of images from the dataset and preview images and calculating a score.

    3. With RecScore, it's possible to assess the quality of models at various training steps and automatically select the best quality step.

With the multiple improvements mentioned above, confirmed enhancements include:

  • Significantly better facial detail fidelity, especially the quality of pupils.

  • Due to the introduction of the evaluation metric, it's possible to confidently use large training steps to ensure high fidelity.

  • For web-based LoRA, there is a significant improvement in overall quality and detail quality. For anime-based LoRA, both the character and style can be restored to the extent that they look like screenshots from videos.

  • At the same time, the original model's generalization ability has not decreased. It's still possible to use generic prompts for outfit changes, and there's almost no overfitting observed.

To illustrate the effects more clearly, let's look at some comparisons.

v1.0 vs. v1.4

Here are two characters under the v1.0 training process:

And here is the same character, using the same original dataset (the dataset used in v1.4's 3-stage cropping is the same training dataset used in v1.0), under the v1.4 training process:

A significant improvement in facial detail is evident, with no observable loss of generalization. In fact, v1.4 might have a stronger generalization ability due to the use of massive datasets.

The abovementioned models, you can take a try by yourself:

v1.3 vs. v1.4

Here's an anime character trained under the v1.3 process:

And here is the same character, using the same original dataset, under the v1.4 training process:

The blurriness issue has been resolved, significant facial detail improvement is noticeable, and sufficient generalization is retained.

This is the abovementioned model, you can take a try:

Limitations of Existing Work

Despite the substantial improvement in automatically generated model quality, there are still some limitations in the pipeline:

  • Video Processing:

    • There are a few instances of failed character detection, indicating that the object detection model still needs further refinement.

    • CCIP's accuracy on anime videos can still be improved.

  • LoRA Training:

    • The issue of dataset image quality filtering remains unsolved, which might result in low-quality images entering the dataset.

    • The clothing clustering problem for characters is yet to be resolved, requiring the training of a contrastive learning model similar to CCIP.

    • RecScore has difficulty distinguishing certain characteristic characters (e.g., characters with horns in Arknights), often yielding scores close to 1.0, even when underfitting is evident.

    • The main function of RecScore is limited to evaluating the character fidelity of the LoRA model. However, currently, there is still a lack of a metric that can assess the controllability or overfitting level of the LoRA model. One possible approach at the moment is to use CLIP to extract features from the generated images and compare them with the input prompts. Nevertheless, there are still several issues to be addressed, and we plan to conduct research and replication of relevant papers in this regard. If successful, the combined use of the controllability metric and RecScore would mean that the best-performing step in all aspects (fidelity, controllability, etc.) could be automatically selected.

Addressing these points will be the direction of our ongoing efforts.

Please continue to follow our work.