santa hat
deerdeer nosedeer glow
Sign In

Working on my first anime-style character LoRA, but limited training set and complex outfit leads to inconsistent output. Advice on workflow, setting parameters, and tagging?

I'm currently working on a LoRA for Etrian Odyssey 4's Dancer character: https://civitai.com/posts/270348

It can generate good images if I leave it running for long enough, but the problem is that it's too janky and unreliable, including when inpainting. The two main problems I have would be that my dataset is limited to 33 images, almost all of them featuring the same outfit, and that the LoRA almost never gets the belt hoop extension and golden disk counts right, often generating extras of the former usually generating either 1 or 3+ rows of the latter. I feel like I'm stuck choosing between overfitting and overly inconsistent output.

What I've tried to do to mitigate the first problem was start by scanning the data set with the WD 1.4 tagger, then split each image up into pairs of cropped square images and prune any tags that no longer applied to each cropped image. I have the bucketed originals at 6x repeat and the cropped pairs at 3x for a 50:50 total mix. Of course, I have no idea whether this is the right approach. Should I be doing something different?

My main approach so far to tagging has been to try brute-forcing things by consolidating tags for hair/eye color, each half of the bikini, and the harem pants/outfit into the "eo4dancer", "dancertop", "dancerbottom", and "danceroutfit" tags, respectively, all left unshuffled at the front of my caption files for LoRA generation. The problem is that there's so much overlap in the training images that they all effectively act as activation tags, requiring me to experiment with tweaking tag weights to reliably swap out parts of outfits.

Another thing I'm trying is to just have the one activation tag and try to boil the clothing down into a consistent set of tags to account for various combinations ("harem outfit", "harem pants", "bikini", "panties"). I haven't tested this as much, but it doesn't seem to do much to help.

A potential problem is that I'm not sure how to tag the hoop-shaped belt extensions and the rows of disks, assuming I should do so at all. I'm split between:

  • "belthoops", "beltdisks"

  • "hoops hanging from belt", "disks hanging from belt hoops"

  • "belt hoop extensions", "belt extension disks"

I'm also not sure:

  • whether to tag the belt itself, as it's got a non-standard design

  • when to tag the shawl for ideal quality; my current approach is to add if less than 25% is cropped

  • when to manually tag the armlet/bracelet/necklace instead of letting the autotagger handle it, as they're very small or lack detail in a lot of the source images, potentially degrading quality, and I've had excess armlets pop up after overtagging

  • whether to tag the boots, as most of the data set has the character wearing them, potentially turning it into a de facto activation tag

  • whether tagging the few images where the character is holding-sword would help with generating such images

  • whether tagging the leaf would help at all with it disappearing at low LoRA weights

Overall, how should I change my tagging practices for better LoRA reliability?

The relevant parameters I've set in my local kohya-based LoRA generator are:

  • NAI base model

  • 32 Dim/16 Alpha

  • AdamW8bit optimizer with cosine scheduler

  • 5e-4 Unet learning, 1e-4 TE learning, 0.05 warmup ratio, and 0.01 noise offset

The main thing that I've experimented with adjusting has been the prior loss weight. 1.0 seems to lead to overfitting, making it hard to change backgrounds, poses, body shape, and such, while 0.7 generates extra limbs and 0.3, while working best, exacerbates the outfit inconsistency issues. Is this worth trying to tweak further, or should I keep it at one of the above values?

An 8-epoch LoRA seems to consistently give better results (not too overfit or weak) with the above settings than anything else regardless of which tagging pattern I use. Should I leave this alone from here on, or might I need to change it alongside other stuff?

0.7-1.0 LoRA weighting seems to work best with the Hassaku model, but >0.8 starts causing overfitting, while <0.9 tends to make shapes of accessories and fine details morph or disappear. Something I've noticed is that adding 0.1 from 0.7 onwards tends to add one extra row of gold disks along with messing with the hoop pattern, making the ideal weight for each seed effectively random. Am I doing something wrong here, or is the mess-up located elsewhere?

Would having different LoRA versions for txt2img, img2img, and/or inpainting be a viable way of working around these limitations?

Finally, is there anything I'm completely missing here?

I know I'm asking a lot, but I really have no idea of where to get answers to all this.

2 Answers

that the LoRA almost never gets the belt hoop extension and golden disk counts right, often generating extras of the former usually generating either 1 or 3+ rows of the latter

Getting counts of specific things right will be a challenge. If your model is getting this right every time, it's probably overtrained so you can consider this a metric for your training.

What I've tried to do to mitigate the first problem was start by scanning the data set with the WD 1.4 tagger, then split each image up into pairs of cropped square images and prune any tags that no longer applied to each cropped image. I have the bucketed originals at 6x repeat and the cropped pairs at 3x for a 50:50 total mix. Of course, I have no idea whether this is the right approach. Should I be doing something different?

Each image should only appear in your dataset once. I don't create mirrored copies when training LoRAs.

I'm not going to individually comment on the sections related to tagging, my impression is that there are possibly too many concepts in a single model. Maybe reducing the scope of this model to a smaller set of costume variations will yield better results. In this case you wouldn't need to spend as much effort tagging each part of the costume.

If you're not already familiar with using it I recommend reading my article on tensorboard to analyze training data. This can be really helpful for comparing different settings when training a difficult model.

1: Remove the clothing vocabulary you want from the txt.

2: Take individual high resolution images of the accessories you want and write the names of the accessories in their txt.

3: Overfitting requires an appropriate reduction in the learning rate and a reduction in the value of dim alpha.

Your answer