Sign In

SDXL loras: 8 dim is enough for real human training. STOP using 128+ dim.

Disclaimer: This is not a scientific paper but rather a piece that includes some of my experiential insights and personal sentiments.

It's time to dispel the superstitions held by many in the community regarding the high-dim training of Lora.

Since the release of SDXL, we have witnessed the emergence of the SDXL Lora community over the past few months, and I respect everyone's exploration and trial and error throughout this process. However, from the beginning, when the first group of people released those enormous Loras, this question has always lingered in my mind:

Do we really need such large Loras to reproduce the character traits we are aiming to train?

Think about it, a full SDXL pre-trained model is less than 7GB, and a pruned 1.5 model can be reduced to around 2GB. Yet, those 256-dim character Loras take up 1.7GB? This is astonishing and does not seem reasonable.

In the recent months, I've been fascinated by training anime character Loras. We've found that even for characters with no prior knowledge in the model, a 4-dim Lora is sufficient for the model to learn most of the accurate characteristics of the character. With our efforts, anime character Loras around 20MB have finally become mainstream.

However, when we look at Loras for real people, gigantic Loras of 800MB each still dominate the scene. It seems as though time has stood still over these months, with no technical progress. I am determined to change this situation, which is why I have decided to write this article.

What better way to prove one's theory than to take action oneself? I immediately set out to work, collected around 200 photos of the famous Ana de Armas from the internet, and trained the following Lora.

Firstly, I trained a 16-dim one. Continuing without pause, I then trained an 8-dim version, which, due to the small number of training steps, was slightly underfit.

Following is a comparison. I used my 8-dim less trained one, and with no disrespect, the 64-dim one from razzz.

As you can see, there are some differences in the generation outcomes of the two models, but it seems hard to determine which one restores the character more accurately. Additionally, my 8-dim version appears to have less "pollution" of the original model's style. Don't forget, a 64-dim lora for SDXL is about 484 MB, which is ten times larger than mine. (I don't train text encoders.)

Of course, the comparison is not scientific since the training sets for the two versions are completely different, but the results can still illustrate some points.

(Using Ana as an example is somewhat of a shortcut because Ana is a figure that the base model already has some recognition of. But this also shows that you shouldn't waste the model's prior knowledge, which means using trigger words like 'ohwx' is not quite reasonable.)

In the following period, I will use the same materials to train 8-dim and 128-dim versions of Lora for comparison, and I will use materials featuring individuals not recognized by the model. If you find this interesting, you might consider bookmarking this article.

Now I want to share some training experiences I've already written about on my Lora page. I am not a professional AI expert; these are merely personal experiences, and corrections are welcome.

My lora here takes about 200 images and roughly half an hour to train—1 repeat, 12 epochs for this one. The entire process, from gathering images and captioning them to training, took no more than 2 hours. I uses gradient accumulation to simmulate 64 batch size, thus I could use a rather high unet lr, around 4e-4 with Lion.

The key is to use a competent VLM to generate captions for your dataset, like LLaVA or CogVLM, or, even better, GPT-4v. Use natural language because it works well with SDXL. Do not bother training text encoders (please check update 2).

For people or characters known by the base model, do NOT use a separate trigger token. SDXL knows who Ana de Armas is, as well as Taylor Swift and Jenna Ortega... NEVER resort to something like "ohwx."

For those not included in the base model, using their names as a trigger token is sufficient. The model might infer some details from their name, such as race or nationality...

This article will be updated regularly with my ongoing practices. Discussions and corrections are welcome. I am not a native English speaker, so part of this article was translated using GPT.

January 21, 2024 update: After conducting tests on multiple characters by myself, I believe it is difficult to use the base model of SDXL that is not familiar with certain individuals to train lora and then restore their appearances across different fine-tuned models. This is because the principle of Lora is like subtraction, assuming target person C, and the model's prior knowledge A, then what is obtained in Lora is C-A=B. However, the prior knowledge of different fine-tuned models may vary significantly, leading to a greatly altered restoration of C. (This is especially true for the training method using names that I previously suggested, which has a significant impact.) Perhaps training the text encoder could help, but the dual te design of the XL model makes it difficult to control the training of the te. I will continue to test and update.

Update 2: Through out some testing, I realize that for those who weren't in base model of SDXL, the training of text encoder is still necessary, otherwise It'll be hard to maintain the look using different ckpts.