XLSD: SD1.5 + SDXL VAE (part 3)

Intro

Holy moley this project is dragging on! (I have so much to learn...)

Continued from https://civitai.com/articles/9551 which has gotten too long to edit again.
But this is a bit of a restart, so you can just keep reading here if you like.

What is XLSD

This is the sage of a noob with no formal "create a base model" experience or training, attempting to do something similar to that, by replacing the SD1.5 VAE, with the better trained one in SDXL.

They are somewhat compatible, but the specifics of the "new" vae, somewhat invalidates the entire contents of the unet in SD1.5. Which means I need to basically retrain EVERYTHING in the model, to get proper colors.

This is in some ways a good thing to being doing anyway, since many things in SD1.5 sucked. Or at least were inadequately trained.

Training and usage tip

I recently discovered that, since I am using the default "aspect ratio bucketing" in OneTrainer, and almost none of the training images are square; it makes the most sense to take samples in portrain mode. I am using 3:4 aspect ratio now.

2025/01/01 New year, new approach!.. somewhat.

Current phase 1

I am continuing to use my earlier made "phase 1" training, which I did using FP32 precision, plus a part of my "CC12M-cleaned" dataset ( a loosely refined version of CC12M), and 4 epochs, which took a few days. The actual dataset used was around 1.5 million images, if I recall.
At FP32 on a single 4090, that takes a while.

So the starting point is something that generally has correct color, but rendering anything is very mushy

Phase 2: 4mp resolution

4mp = 4 megapixel. Not to be confused with "4k" resolution, which is actually 8mp.

A lot of the CC12m dataset is of terrible quality. It turns out however, that if you just go with the 4mp or larger images, quality is much more consistent. Plus, even if the focus isnt quite right at 4mp.. when it is downsized to 512px, it still looks awesome!

I went through the "cleaned" CC12m dataset mentioned above (which is actually around 8 million urls) and created a derived dataset of all the 4mp images, and then did some cursory additional filtering to attempt to remove watermarks, etc

https://huggingface.co/datasets/opendiffusionai/cc12m-4mp

This leaves around 160k images to work with.
According to some published formulas, that means the largest batch size I should use is 256.
Optimal batch size for AI training in general is allegedly 512, but that requires a minimum of 256K images in your training dataset to get best quality results (and more would be better)

I just kicked off a "phase2" finetune, using DLION(an adaptive optimizer), and BF16, at effective blocksize 256. It should take around 5 days, since there are 5000 steps per epoch, and I am getting 1.3S/it throughput.

Cover picture

The cover picture for this article is currently a comparison with my "phase1" model output, vs epoch 5 of my current "phase 2" training under way.

The phase1 output at least has correct colors (unlike the phase0 output, which you can see in the prior article). However, it obviously has a long way to go. It may not show up well, but all the output is fuzzy, almost like a photo converted to old-school low res 8bit graphics.

The phase2 output has some very promising realism starting to show up. Which is why I have the training set for 100 epochs.

Here's hoping the epoch100 samples are truly impressive.

See you in a few days...

2025/01/03 epoch 30 of 100

Side note: I bumped the step count in the OneTrainer sample thing from 20 to 40

2025/01/04 epoch 44 (total steps 220,000)

DLION has now adapted LR to be 7.5e-05, with this batch size of 256, whereas in the early steps, it was 3e-05

Obviously, some things are improperly shaped and balanced. But I think the lightning is really nice and proper.
You can see hints of arm hair. Plus the skin texture looks pretty durn good.
And overall, the subject looks detailed EVEN WHEN ZOOMED IN!

The benefits of training on 4mp images, even for a 512x512 model, seem to be showing through.

Things degrade for a few cycles, but then catch up a bit again at epoch 50:
(and Learning Rate has now auto-adjusted to 8.3! )

2025/01/10 - going all-in on Square

I didnt like where the above was going. The samples above are really nice! ... but were not representative of the average.

I wanted to train exclusively on square(ish) ratio pics to improve the general case. But... I needed more pics!
so I had to resort to some 2mp photos, and.... the bulk of the results were not pleasing.

So.... I gave up, and started a long phase2 run with FP32, square-aspect only.

Sadly, this means I am no longer getting portrait views that are quite as nice.
When I do some more filtering, I hope to do more extensive training.
That being said, the square-image training IS improving the portraits from base. Its just difficult to pick out my specific next steps.

Would be REALLY NICE if someone with a 4090 could volunteer to run some AI filtering for me on the remaining unsorted dataset, so I could get cleaner inputs.

(I give you the scripts and images, you just run the script)

Teaser

This was a temporary training result that I cant reproduce properly yet.

It was done with training excusively on 1:1 ratio images, with dlion,fp32.

But as I said, I cant reproduce it. sigh.

Current status...

(running bf16 tuning)

Epoch 9...

and I have messed up cropping. starting again, after tweaking dataset definitions.
sample of stupidity:

civitAI broken

It isnt taking my edits.

Resummarizing: I'm trying to integrate a new dataset into the training so I'm restarting phase 2.

Temporary model output can be played with at

https://huggingface.co/opendiffusionai/xlsd16-phase2

Raw dataset is at https://huggingface.co/datasets/opendiffusionai/laion2b-en-aesthetic-square

but I'm in the process of cleaning it up a lot. Then I'll recaption it and redo phase2.

2025/02/05 update

I've done a phase3 model with square-only, and I kinda like the improvements... but it is a very generic dataset.

I've now condensed out a "human only" square subset. I'm trying phase4 with that.

https://huggingface.co/datasets/opendiffusionai/laion2b-en-aesthetic-square-human

Unfortunately, this is a "tiny" dataset. Only 8k images. So I'm back at LORA levels of tuning now.

DLION at batchsize 16 wants to use LR=3e-05 after 20 epochs.

The thing is... samples are changing too rapidly, and not improving details.

I like some of the changes... but its nowhere near converging.
I COULD just let it run for a day, I suppose.... but I'm going to try using that value, but switch to ADOPT, batch16, warmup 300 steps, EMA update step 75. and then run it for 200 epochs to see what happens.

2025/02/19 update 512x768 training

Spent a long time creating a NEW new dataset.

https://huggingface.co/datasets/opendiffusionai/laion2b-23ish-woman-solo

Important features:

This is ONLY 2:3 aspect ratio
I added WD style tagging.
They are all hand chosen
I'm experimenting with training explicitly at 512x768 resolution

There are some initially interesting results. I'm not sure whether its because of the larger dataset size, or the larger image training size, or what.

I DO notice that it seems to give best results when I train with BOTH the natural language captions, AND the wd-tag captions. (so each image is treated like two images)

Maybe I need to also do the same thing with the "squarish" set.

But also... Should I try training the squares at 768x768? Hmmm..

Overnight results

32 epochs out of 100 so far.. (DLION, b16a2, const, bf16, fallback fp32)

Very cherry-picked sample. (side note: DLION has reached LR of 4.5e-05 for this one)

If this thing actually converges, rather than this being a lucky random sample, we're all set!

Interesting thing: I'm training exclusively with 512x768 images, but the pure square 512x512, shown here, is what is improving most of all at the moment.

News Flash: a radical change

See part 4, in https://civitai.com/articles/11818

XLSD: SD1.5 + SDXL VAE (part 3)

XLSD: SD1.5 + SDXL VAE (part 3)

Intro

What is XLSD

Training and usage tip

2025/01/01 New year, new approach!.. somewhat.

Current phase 1

Phase 2: 4mp resolution

Cover picture

2025/01/03 epoch 30 of 100

2025/01/04 epoch 44 (total steps 220,000)

2025/01/10 - going all-in on Square

Teaser

Current status...

civitAI broken

2025/02/05 update

2025/02/19 update 512x768 training

Overnight results

News Flash: a radical change

Comments