Is that a typo? I thought it was XLsd ?

The old model was indeed called XLsd.

But I am finally moving forward with my plan to replace the Text Encoder entirely, with Long-CLIP !!

So, new model, (slightly) new model name

Welcome to XLLsd v0

XLLsd: SDXL VAE, Long-CLIP, and sd1.5 unet model

This gives us a "vae" that is more capable of upscaling details. And a CLIP/Text Encoder that can take up to 248 tokens instead of a paltry 77!

WIth the existing clip, if you wanted to train lots of details, or generate an image with lots of details, you had to use tag format (woman,flower,desk,kitchen) to squeeze them all under the 77 token limit. But with the new model, there is room for natural language prompts such as "A mature woman stands in a kitchen. There is a desk in the background, with a flower on it".

In theory, anyway. Now I have to train up the model and prove or disprove the theory!

Starting from scratch though? Really?

In theory, I could have put LongCLIP on my latest XLsd model. I chose not to do that, because:

I am a little worried about the possibilities of long-term overtraining
In theory, the longer CLIP lets me train on more complex data, faster. Which leads to,
- I want to provide other people a faster path to "recreate from scratch", in the spirit of good Open Source releases.

Plan overview

Train up initial frankenstein version so that it renders colors reasonably
Keep it simple: Just WD tagging, using tag shuffling
Give the Text Encoder a little fine tuning based on step 1 model
Using patched(unlock 248 token CLIP) OneTrainer, train with NLP captions. Enable "tag shuffling" based on SENTENCES: period seperator, instead of comma seperator.

Somewhere in the above, also try adding smarts to the model like "right/left/up/down" understanding, "full shot/medium shot/close up shot", and also lighting style, etc. If it was not already picked up by the NLP training.

The challenges here is that we want to improve text understanding, but we ALSO want to improve image output quality. So at some point, we will need to do additional .lruns at low LR to pull in the small details.

Some people like to blindly just pick an epoch count, then set COS or linear schedule, cross their fingers, and hope things turn out all right.

Instead of that, (after step 2), my plan is to use CONST schedule, run until it either converges or stops improving, then drop the LR and repeat.

Untrained base

You can find the initial untrained base, at https://huggingface.co/opendiffusionai/xllsd-alpha0 along with full details how I put it together.

You probably don't want to USE that. I'm just sharing that to keep my efforts open source, with full transparency!

Upcoming alpha versions

I hope to release an "alpha1" version soon. Unfortunately, my initial plans didnt go quite as well as I had hoped. But, the cover is an early teaser image.

Remember, this looks worse than the prior XLsd images, because I am starting from ZERO again, whereas that model had 1million+ training steps added to it

Details of iffy first try

Here's what went wrong with my first run.

For my first try at XLLsd phase1, I thought I would just use a "solo woman only" dataset, with ONLY long LLaVA natural language captioning, in hopes of getting to "pretty" results faster.

Plus, l, the ability to use longer captions was the primary reason to do the transition to LongCLIP, so I thought I would see what happens with the super long (and in theory more accurate) LLaVA captions.

Since this is a smallish dataset of "only" 22,000 images, I trained with DLION on b16a4. (still fp32, of course)

Initial trianing went seemingly really well. It got to insanely low smooth loss values: 0.11 !

But... after 30 epochs, the validation loss curve had flatted out, and was even starting to trend upwards. Without reaching the level of quality of the prior XLsd model. What you see above, is the single cherry-picked sample at epoch 27, before things seemed to get worse for 3 epochs (and then I gave up at e30)

So, my next plan is to revert to more general dataset tuning as phase 1, just to get the VAE settled down. After that, I'll try the "solo woman" dataset to compare.

There is also the big difference that previously, I was running sets twice: Once with WD14 captioning, and one with moondream natural language captioning. So, it would probably benefit from doing that again.

Or, in the most extreme, 3 sets, for WD14, moondream, AND LLaVA?

Too bad I dont have more than the one 4090 server to try ideas out in parallel.

2025/04/18 - batchsize 64?

I knew that LION really wants batchsize 64 or larger. But largest I could ever get with dlion, was 32, and more typically used 16 with an accum factor.

However, I just discovered that with bf16, AND pure "LION" rather than D-LION, I can fit b64 on my 4090.

(I might be able to fit b64 in fp32 if I had a 5090, but.. that aint happening soon)

So I'm trying b64a4, starting from scratch yet again, to see what happens.

Calling this test model Xllsd16. Gonna be a while until anything to show for it, especially since I'm trying tag shuffling on this one... but didnt realize that text embeddings take up more space than latent images.. so I ran out of disk space, TWICE.
Days of training wasted :(

One of the multiple reasons I wasnt using bf16 is that there isnt a public version of D-LION that supports stochastic rounding, so rounding errors at bf16 supposedly get much uglier than fp32.
But on the positive side.... to my surprise, I can fit bs64 with bf16/fallback-fp32, not just bf16/bf16.
So maybe it will work out well.
Maybe it wont, but... I need to be methodical and try it out!

Part of the reason for this is that it may be possible for me to pick up a DGX spark in May.

While the training speed will be poor... it will be able to run fp32 batchsize 64 or perhaps 128 DIRECTLY. So any fp32 training I did now, would probably just get redone for that anyway.

2025/04/20 update -- dealing with loss (function)

After the transition to bf16, I figured I might take advantage of the fact I have to recache everything, to also try out tag shuffling.

Turns out this means i need to increase the "number of text variations" value in OneTrainer, for it to actually be used.

THEN I find out, when I bump it from 1 -> 4 ... that text encoding entries are HUGE, and it made me run out of disk acpe and kill the run.

Twice.

THEN I find out that the "loss function" thing is even more important than I originally though, so I need to compare CONST vs minSNR vs DebiasedEstimation

THEN I find out that the default gamma value of 5 for minSNR in OneTrainer is probably wrong for realistic photo training, and it will probably be better off at 1. But.. I need to test that of course, so.... More restarts.

There's at least a days' worth of training for each, to do a decent comparison :(

1 epoch, 200k images @b64== 3000 steps @ 2s/it ~= 2 hours. Target is 8 epochs, so.. multiple days left to find out which one is really best for all this :-/

I'll do a post with pics when I'm all done with the training rounds.

Edit: Orrrrr.. based on the paper on debiased estimation at https://arxiv.org/abs/2310.08442 it should pretty much ALWAYS be better than minSNR, for the same amount of effort, and same or shorter amount of time. So.... maybe I'll just let my current training run for debias run all the way out to 100 epochs, and go with that if I dont hate it.
Life is short :)

Speaking of which: Debiased Estimation ignores Gamma setting, so less tuning to do!

Here's what o3 suggests, when I told it my desired batchsize, and that I was using a 200k dataset with NLP captions (which I will eventually!):

######## shared ########
precision: bf16
gradient_checkpointing: true
stochastic_rounding: true  # not currently support by OT, but.. good to know
################################

# -------- Phase 1 --------
train_unet: true
train_text_encoder: false
batch_size: 64              # micro
gradient_accumulation_steps: 4
learning_rate: 2e-5
save_name: ckpt_unet_only

# -------- Phase 2 --------
# reload ckpt_unet_only
train_unet: false
train_text_encoder: true
batch_size: 32              # micro
gradient_accumulation_steps: 8
learning_rate: 2.5e-7
weight_decay: 0.01
save_name: ckpt_clip_only

# -------- Phase 3 --------
# reload ckpt_clip_only
train_unet: true
train_text_encoder: false
batch_size: 64
gradient_accumulation_steps: 4
learning_rate: 1e-5
steps: 2000
save_name: final_finetune

2025/04/12 - Just when I thought it was safe to go back in the water

I hear on the OneTrainer training discussion channel, the notion that no, bf16 is NOT the optimal lower precision training, and that fp16 is better.

But I've heard conflicting info about this. It for sure depends on which model you may be training, but even specifically for sdxl, seems like different people have different opinions on it. Soooo.. now I have to go test this OTHER line of comparison. Sighhh....

XLLsd: Part 1