(sorry, the banner is NOT generated by the model! :D )
Forward
I am hereby documenting my path through developing the "new" XLsd model, so that the knowlege may not get lost to time.
XLsd is a model with the SDXL VAE, in front of SD1.5, as a drop-in replacement for the original. (they are the exact same architecture!)
If you just drop it in, though, its output looks like this:
So, there is a lot of retraining that needs to be done.
This is currently not a cleaned-up "how to" guide. This is basically a history documenting my journey, in all its mess.
SD1.5 XL rework
I should probably put this in its own section eventually. But for now I'll just slap it on here.
I'm attempting to retrain SD1.5 with the XL VAE.
(Recap: this doesnt make SD go to 1024x1024 resolution. What it will hopefully do, is just give more consistency when handling fine details)
This task is ALMOST, but not quite, like a "train model from scratch" endeavour.
I'm trying to use this paper as a milestone marker for things. (Especially Appendix A, page 24, Table 5)
For example, they started out with 256x256 training initially (at LR=2e-04 !!), and only did 512x512 for the last 25% or so?
I tried doing that, but... I'm a bit skeptical of results I am seeing. So I am currently just doing 512x512 training only.
THOSE guys did the very first round of 256x256 training with ADAMW cosine, but then after that, always used ADAMW constant. (batch 2048, LR=8e-05)
Then again, they also weaved in use of EMA.
Specifics on XLsd tuning
For my experiments so far, it seems like I need to do at least two separate rounds of training.
.... Documenting my test history ...
I was messing around with my cleaned up varient of CC12m dataset. Tried a bunch of combinations:
1. 256x256 training, which got rid of the bad colors and odd... artifacts.. kinda fast.. but then it was a 256x256 model again. I think it regressed things too much, so that when I followed up with 512px training, wasnt as good as I liked
2. A buncha 512px training varients: adafactor, adamw, lion, mostly at batchsize 32 (the largest I can fit on a 4090). either it didnt quite get nice enough for my liking, even after going through a million images of training.... or I put EMA on it, and then it wasnt dropping the transition artifacts fast enough. Even at LR=8e-05
So I finally gave up on the above, and changed to a cleaner smaller initial dataset, AND a smarter optimizer. I recalled that unlike SDXL finetuning, I can actually fit prodigy optimizer into vram for sd1.5 training!!
Round 1 plan
Train at 512x512. Get the colors fixed, and basic shapes brought to human standards. NO EMA
The EMA problem shouldnt be too surprising: after all, EMA in one sense prevents large changes.. but what we want is exactly to make large changes.
Best results so far for round 1 training: prodigy, bf16,batch=32, no ema
Only fits if you enable latent caching for some reason. Then it takes around 19GB.
(Currently training on https://huggingface.co/datasets/opendiffusionai/pexels-photos-janpf since it is all ultra high quality, zero watermarks or any other junk)
The most interesting thing is that Prodigy is an adaptive optimizer. After 4000 steps, it eventually adapted to 8e-05.. Exactly the same LR that the ppaper I mentioned earlier, used.
(well, technically, prodigy picked 8.65e-05)
It is unclear to me whether the LR prodigy chose for this round was primarily due to the dataset, or the tagging, or the FP precision, number of steps (130k images, batch=32,epochs=5), or....
Round 2 plan
Handle more like a finetune. Add in EMA, for more "prettiness" (i hope)
Note: for THIS round, prodigy eventually settled on LR=1.81e-05
This was for a dataset of 2 million images, batchsize=32, epoch=1
XLsd redo section (where I throw away my prior plans)
Phase 1 redo justification
I have my suspicions that I messed up phase 1 on multiple levels:
Using a bad WD14 tag set
Having my "fallback train type" set to bf16 when it should have maybe been float32?
So, I'm redoing it with Internlm7 NL tags, plus going overboard and training with full float32 (even though saving as bf16)
(PS: float32 didnt work out well first round. Would be too slow to figure out what values work right, so going back to bf16)
Interesting things of note:
I can just barely squeeze batch=8 on my 4090. That takes a 24hour 1epoch run, into an estimate 7 hours. OOps, no I cant. It crashed after 170 steps. Back to batch=4.
Which starts at VRAM=22G, but rises to 23.1G around 200 steps in. Not too bad though: only a 9 hour run.My smaller batchsize tests resulted in prodigy using lower LR when everything else was the same. So it is indeed doing the "large batch, compensate for LR" thing after all.
But... second rounds, it somehow forgets how to adjust LR. :(Going to try EMA. The normal training gets rid of the colors fairly fast, but... distorts the human figures horribly. It takes a long time to work around to being close to normal.
Hopefully using EMA will minimize the distortion.longer-term LR for the first training set, 130k images at batch=4, is LR=2.81e-05
(or is it?)
LR values (for Phase 1)
(currently 20 warmup steps)
Lion, b32, LRS=Batch, LR=3e-05, EMA100: TOO BIG!!
Lion, b32, LRS=(anything), LR=2e-05 (effective 1.13e-04), EMA100 usable.. until 700 steps. Then it goes all grey
Lion, b32, LRS=both, LR=1e-05 (effective 5.66e-05) , EMA100 ... usable.. pending full results
(It looks like it may converge too early. Need to do a prior run first?)
For Prodigy, no LRS, batchsize=8. It seemed to reach peak value just after 2000 steps. (LR=7e-05)
Combining it with Cosine may be a good idea. With a learning period of 1, cosine vs constant looks like this(6.75e-05 peak):
A Redditor suggested that if I wanted to match prior papers, I might try LR=3.75e-6 for batch=32
That being said, prodigy with batchsize=32, e=4 seems to peak around 5000 steps, LR=8.48e-05
XLsd stage one quality notes
Using prodigy at batchsize=8, it took approximately 1000 steps until the horror that was the initial mashup, was no longer way out of color matching and oddly pixellated.
It took 7000 steps until colors were reasonably "good".
Human figures at stage 1 training, however, remained pretty bad.. .all through 20,000 steps?
I suspect that when a model is this far out of whack, it is beneficial to keep batch size relatively small, until you get closer to target weights. Because higher step count gets you there faster.
But... waiting for confirmation on a redo with higher batch size.
I suspect that training on the same number of images with larger batch size will give inferior results, even though prodigy will automatically raise the LR because of larger batch size. So, I should probably do 1 run with epoch=4 (to match number of steps)
rather than 1 run with same epoch(1)
Early runs suggest batch32 converges to something non-pixellated faster.. but also hits errors faster. Such as the missing right lower arm, at only 500 steps.
It has somewhat improved, for b32 at 5500 steps. But... not great.
Would EMA help? Does "cosine with hard restarts" reset EMA, I wonder? If so, that might help.
(edit: turns out that no, hard-(reset) does NOT reset EMA curve)
For my phase1 base, I am currently going with the results of
prodigy, b8, cosine hard restart (10 cycles)
No EMA
across approximately 160,000 images.
SDXL stage 2 in progress...
The interesting thing is, when I fed the phase1 model into training for phase 2, prodigy started at its 1e-06 baseline.. and stayed there.
I lowered the baseline to 1e-07. It still stayed there.
This is for batch=8. I think I shall see what happens if I make batch=16, since prodigy tends to auto-scale LR based on batch size, as that is typically required.
(nope, no help)
I am now comparing prodigy to dadapt_lion. Surprisingly,
After 2000 steps with initial at 1e-08, b=8, prodigy did not adapt
After 2000 steps with initial at 2e-08, b=8, dadapt_lion DID just start to adapt.
I am rather shocked by this, since prodigy self-adapted quite nicely in phase 1. So.. why does it stop now?
Potential cause: I discovered that I had weight decay set to 0.01 in the optimizer.. probably for some other one.. but it held over when I switched to prodigy. When I hit "restore defaults" for adap-lion, it changed the value to 0.0
So at some point, I might try prodigy again.
Random update...
holey smokes its actually generating somewhat human faces in samples now! (sometimes)
This is around... 500k images in?