Sign In

Comparing SDXL and SD3.5 medium training

2

This is a follow up from my previous article: https://civitai.com/articles/13012?highlight=948622#comments

You can download the latest model here: https://civitai.com/models/1404024/miso-diffusion-m-11

or Download from huggingface if you prefer to seperate the text encoder: https://huggingface.co/suzushi/miso-diffusion-m-1.1

In this article, I will focus more on showcasing the difference between the two model on training.

You might see some bad results from early sample, so please be aware as you scroll and read this article.

Starting with vram consumption, while SD3.5 medium utilizes less vram, DIT are in fact more computationally expensive. The same goes for local generation, on a 4gb gpu, while SDXL needs to offload some of the model to ram, generating a image requires around 30 seconds. On the other hand SD3.5 medium need roughly 90 seconds using half the vram.

The two model also has significant difference when it comes to the textures.

SD3.5 medium with 160k 5 epoch + 600k 2 epoch

SDXL trained with 360k 5 epoch + 600 k 2 epoch

Originally in SDXL, noise offset was introduced because the model lacks the capability to generate brighter or darker image. In fact it would "expose" image to medium pixel value. While I haven't get to thoroughly test SD3.5m, the result seems that this issue is fixed already. Using any sort of noise offset would make the model generate image that are too bright. This may have contributed to many training fails because in the beginning many people carried over their original training setting for SDXL. For instance, here is what SD3.5m trained with noise offset look like:

Moving on we have other challenges when training SD3.5m, it is very slow at recognizing characters, while SDXL would have been able to replicate part of the character already.

Shiroko Terror from Blue archive

Despite this, one of the most request feature for newer model is artist tag. However, it is also slower at learning artist tags. So far it was trained with roughly 80 artist tag. But it seems that it has no impact when generating image at all.

Other things often talked about when training SD3.5m is model collapse. In the beginning I tried to train it with larger dataset directly, however beyond certain steps the model would start to generate more artifacts on my test prompt. If we continue to push training beyond this point the model would completely lose the ability to generate image. This has nothing to do using a wrong scheduler or so.

It is unclear to me why this happens, despite using nearly identical setting. So instead I reverted back to smaller dataset and add on top of it. This time the result was rather promising and I can slowly see how the model adapts to anime images.

And the last problem is SD3.5 medium itself. I think the model struggles with trees and flowers,

At first I thought this was my training parameter which caused this, however after seen some sample image from https://huggingface.co/bghira/sd35m-sfwbooru, I think its certain that at least when generating trees and flower, the model can not create clear pixels or would blur the border.

But even with all of these, I think the short comings can be ignored with the advantage of the model. At the moment it shows better understanding then SDXL, this can be seen through better hand placement and the pose it attempts to generate.

Latest 1.1

Compared to SDXL, which has seen more samples and training, can't correctly position lying on bed and would generate a lot of deformed body. This can be seen in other prompt such as hands in the back, arms behind head etc. Which I consider a very basic pose for diffusion models. I also think the model might benefit more if trained at 1440x1440 directly, since SD3.5 medium natively supports it. But that's the plan in the future.

As I discover more details, I would try to write some articles as well in the free time to show some insights. I decided not to train t5 at the moment as that would require more vram.

Below is the training config for anyone interested:

loss_type = "l2"   
huber_c = 0.1
huber_schedule = "snr"
max_bucket_reso = 4096
min_bucket_reso = 512
max_grad_norm = 0.01
enable_scaled_pos_embed = true
pos_emb_random_crop_rate = 0.2
weighting_scheme = "flow"
learning_rate = 3.5e-6
learning_rate_te1 = 2.5e-6
learning_rate_te2 = 2.5e-6
sdpa = true     
mem_eff_attn = true
Train Clip: true, Train t5xxl: false

2

Comments