Looooong time no see！
I——bdsqlsz，i had made DAdaptation V2、V3、prodigy implement in kohya's sd-scripts.
Of course this is only some minor work, so here is the first thanks to the authors of these optimizers, without whose papers and algorithm it would be impossible to implement them into lora training.
You should have heard of DAdaptation long ago, and someone has written a tutorial on Lazy training with DAdaptation.
BUT...The situation has changed! The new optimizer puts an end to everything.
That is Prodigy！
What is it?
Well, let's start at the beginning.
Back in early May, I tried to use dadapataion to train dylora, but since the two adaptive algorithm apparently conflict, it rightfully failed.
Prodigy：An Expeditiously Adaptive Parameter-Free Learner
ok，my issue has completely solved.
As you can see, unlike DAdaptation, Prodigy automatically adjusts the learning rate after each gradient accumulation backpropagation. .
Simply put, it is adaptive auto-learning throughout.
You no longer need to try to adjust redundant hyperparameters, that's what makes Prodigy so great.
↑This is SDXL training,I dare say this is the best loss chart I have ever seen.
So the question arises,
How best to use Prodigy?
What many people don't know is that optimizers also have their own unique parameters, and like Prodigy at least a few fixed parameters have to be turned on to get the best results.
I generally use my own script written to train lora, so it's completely different from the GUI version. Here I share the custom parameters I use.
$lr = "1"
$unet_lr = $lr
$text_encoder_lr = $lr
Many of the parameters you may have seen in using dadaptation, so I won't explain them again.
If you haven't seen it, please see the tutorial at the top about dadaptation related.
Briefly, a few new parameters.
d_coef, the recommended range of 0.5 to 2. The coefficient used to adjust the D value, but of course I prefer to use 2 by default.
If you use constant or warmup, you need to turn it on.
d0, the learning rate starting value. Some special lycoris types have larger initial learning rate requirements, such as IA3, Lokr, Dylora, and SDXL's lora
So if you use these special types above, you need to set a relatively high initial value to make Prodigy start faster.
The last is to make the three learning rates forced equal, otherwise dadaptation and prodigy will go wrong, my own test regardless of the learning rate of the final adaptive effect is exactly the same, so as long as the setting is 1 can be.
This is result for SDXL Lora Training↓
Batch Size 4
VRAM spends 77G
Oh I almost forgot to mention that I am using H10080G, the best graphics card in the world.
and 4090 can use same setting but Batch size =1
it almost spends 13G
if you use gradient_checkpointing and cache_text_encoder_outputs may costs less VRAM
Dont use dropout and max normal regulation with Prodigy
Two adaptive algorithm apparently conflict to get worse result.
If you are interested in the training script (command line tool) that I wrote myself, you can leave a comment here. If more people are interested, I will consider translating it into English.