Sign In

XLSD: SD1.5 + SDXL VAE (part 3)

XLSD: SD1.5 + SDXL VAE (part 3)

XLSD: SD1.5 + SDXL VAE (part 3)

Intro

Holy moley this project is dragging on! (I have so much to learn...)

Continued from https://civitai.com/articles/9551 which has gotten too long to edit again.
But this is a bit of a restart, so you can just keep reading here if you like.

What is XLSD

This is the sage of a noob with no formal "create a base model" experience or training, attempting to do something similar to that, by replacing the SD1.5 VAE, with the better trained one in SDXL.

They are somewhat compatible, but the specifics of the "new" vae, somewhat invalidates the entire contents of the unet in SD1.5. Which means I need to basically retrain EVERYTHING in the model, to get proper colors.

This is in some ways a good thing to being doing anyway, since many things in SD1.5 sucked. Or at least were inadequately trained.

Training and usage tip

I recently discovered that, since I am using the default "aspect ratio bucketing" in OneTrainer, and almost none of the training images are square; it makes the most sense to take samples in portrain mode. I am using 3:4 aspect ratio now.

2025/01/01 New year, new approach!.. somewhat.

Current phase 1

I am continuing to use my earlier made "phase 1" training, which I did using FP32 precision, plus a part of my "CC12M-cleaned" dataset ( a loosely refined version of CC12M), and 4 epochs, which took a few days. The actual dataset used was around 1.5 million images, if I recall.
At FP32 on a single 4090, that takes a while.

So the starting point is something that generally has correct color, but rendering anything is very mushy

Phase 2: 4mp resolution

4mp = 4 megapixel. Not to be confused with "4k" resolution, which is actually 8mp.

A lot of the CC12m dataset is of terrible quality. It turns out however, that if you just go with the 4mp or larger images, quality is much more consistent. Plus, even if the focus isnt quite right at 4mp.. when it is downsized to 512px, it still looks awesome!

I went through the "cleaned" CC12m dataset mentioned above (which is actually around 8 million urls) and created a derived dataset of all the 4mp images, and then did some cursory additional filtering to attempt to remove watermarks, etc


https://huggingface.co/datasets/opendiffusionai/cc12m-4mp

This leaves around 160k images to work with.
According to some published formulas, that means the largest batch size I should use is 256.
Optimal batch size for AI training in general is allegedly 512, but that requires a minimum of 256K images in your training dataset to get best quality results (and more would be better)

I just kicked off a "phase2" finetune, using DLION(an adaptive optimizer), and BF16, at effective blocksize 256. It should take around 5 days, since there are 5000 steps per epoch, and I am getting 1.3S/it throughput.

Cover picture

The cover picture for this article is currently a comparison with my "phase1" model output, vs epoch 5 of my current "phase 2" training under way.

The phase1 output at least has correct colors (unlike the phase0 output, which you can see in the prior article). However, it obviously has a long way to go. It may not show up well, but all the output is fuzzy, almost like a photo converted to old-school low res 8bit graphics.

The phase2 output has some very promising realism starting to show up. Which is why I have the training set for 100 epochs.

Here's hoping the epoch100 samples are truly impressive.

See you in a few days...

2025/01/03 epoch 30 of 100

Side note: I bumped the step count in the OneTrainer sample thing from 20 to 40

2025/01/04 epoch 44 (total steps 220,000)

DLION has now adapted LR to be 7.5e-05, with this batch size of 256, whereas in the early steps, it was 3e-05

Obviously, some things are improperly shaped and balanced. But I think the lightning is really nice and proper.
You can see hints of arm hair. Plus the skin texture looks pretty durn good.
And overall, the subject looks detailed EVEN WHEN ZOOMED IN!

The benefits of training on 4mp images, even for a 512x512 model, seem to be showing through.

Things degrade for a few cycles, but then catch up a bit again at epoch 50:
(and Learning Rate has now auto-adjusted to 8.3! )

2025/01/10 - going all-in on Square

I didnt like where the above was going. The samples above are really nice! ... but were not representative of the average.

I wanted to train exclusively on square(ish) ratio pics to improve the general case. But... I needed more pics!
so I had to resort to some 2mp photos, and.... the bulk of the results were not pleasing.


So.... I gave up, and started a long phase2 run with FP32, square-aspect only.

Sadly, this means I am no longer getting portrait views that are quite as nice.
When I do some more filtering, I hope to do more extensive training.
That being said, the square-image training IS improving the portraits from base. Its just difficult to pick out my specific next steps.

Would be REALLY NICE if someone with a 4090 could volunteer to run some AI filtering for me on the remaining unsorted dataset, so I could get cleaner inputs.

(I give you the scripts and images, you just run the script)

Teaser

This was a temporary training result that I cant reproduce properly yet.

It was done with training excusively on 1:1 ratio images, with dlion,fp32.

ImageBut as I said, I cant reproduce it. sigh.

Trying related things.

Now that i'm using fp32 (not to mention a 400k dataset), updates take days...

Current status...

Epoch 9...

and I have messed up cropping. starting again, after tweaking dataset definitions.
sample of stupidity:

civitAI broken

It isnt taking my edits.

Resummarizing: I'm trying to integrate a new dataset into the training so I'm restarting phase 2.

Temporary model output can be played with at

https://huggingface.co/opendiffusionai/xlsd16-phase2

Raw dataset is at https://huggingface.co/datasets/opendiffusionai/laion2b-en-aesthetic-square

but I'm in the process of cleaning it up a lot. Then I'll recaption it and redo phase2.

0

Comments