Z-Image / LoRA Training / Isolating Details

In this mysterious realm of LoRA training there's never going to be testable proof that I would be able to provide in order to assert that I'm right about a thing or that another person is wrong about a thing. There's more than one way to accomplish something and I'm going to provide one of them, though I have tested everything I can think of. You will undoubtedly find some other method that works and I may have even tried it and discovered that it didn't work for me, but my evaluation doesn't mean that it doesn't work, it would mean that it just didn't work for me, and that's all it means.

Goal ... I want to train a LoRA with enough steps to introduce a concept, and summon that concept through prompting, but I don't want the model to produce data that I didn't intend to train into it.

The problem? A concept may not be fully learned or learned well enough to be useful without introducing artifacts or unrelated data if given enough steps to produce the intended visual response, unless some method are provided to obscure or eliminate that fringe data from the training process.

The solution? There's a thing called "masking" for training, which I don't use because it's more work than what I do use, and I've never tried it for the same reason, though my understanding of it is that it's intended to produce the effect that I'm asserting here.

There's another approach to this but it's a delicate process, a lot more time consuming and, in some cases, just not a viable solution because of limited available data. The alternative to masking would be to use different subjects for each image, though sometimes the trainer will lock onto a single image and take those characteristics with it, so that's not a consistent solution for this reason as well, and there are other reasons why I believe this didn't work well for me.

The masking idea is that we restrict areas, eliminating data from training images, so that the data can't be learned. In the use case I'll describe here this will be for facial characteristics so, for instance, if one would want to train a model to learn a specific eyebrow shape then this technique could be applied.

When using this non-traditional method of masking one would simply paint over areas that shouldn't be learned, leaving enough detail for the model to know where the concept fits in (note that this is important for human subjects though there are interesting approaches without this consideration). While this can also have its drawbacks, with the mask color permeating the generations, it does lend itself well to the intent and, using an alpha of 1, disparate from the rank, seems to almost completely get rid of the mask bleed-through. It also helps to tell the training process what to exclude from training, and I do this directly inside my caption, yes, that's right, captions (with newer architecture) are not triggers, in my experience, they are directives and concept weight, I may write something about this later but for this particular article I'll cap that by saying I just tell the process what the painted areas are, why they are there and simply to not learn them.

From my testing (which is something I love and is probably the most motivational portion of my hobby with regard to LoRA training), I discovered that shapes are learned, so using different painting patterns can be crucial, for instance I do all the painting by hand, which ensures that every pattern/shape will be different. If all that's required is cropping then I'll do that instead, sometimes a combination of both are required.

After extensive testing I've chosen white for this but one might get the idea of using an alpha or black instead of white, which I've already done and won't attempt to discourage testing but it didn't work for me. I didn't try other colors.

I mentioned above that it's important to leave enough data in an image in order to ensure that the training process knows which is a match to the result, so knowing that the eyebrows belong to a forehead may be important, and has been in my experience. I've done this with fingernails as well and those are a bit obvious but still held to certain guidelines that included enough areas surrounding the fingernail in order that the process knows where the nail belongs. You've probably noted that, not just with Z-Image, we can get a body part to show up on an unrelated, unexpected, area of the body so this visual location directive seems rather logical.

I have one more note to add, that the process of training a LoRA in this way is actually providing a guidance to the model that we may not initially expect, and I didn't. It's something we don't see but the model does. The model has an idea of what a subject looks like, given any number of data, despite that the "subject" may be completely missing from the image so, for instance, if I've masked consistently, and well enough, were the subject couldn't be identified by any means whatsoever, any characteristic of same, like their nationality, the model will assume a type, a shape, a race, gender or maybe even an unknown embedded person that is nameless, and it will become part of the results of triggering the concept from the LoRA. This mysterious apparition can be helpful but I haven't found a way to guide it, just see what happens.