I've been experimenting with different optimizer setups for training diffusion models over the last few days.
The original motivation was pretty simple: reduce optimizer memory usage enough so I could fit larger diffusion models on a single GPU.
The Problem with Existing Optimizers
While working on this, I ran into specific limitations with the usual suspects:
AdamW8bit: Works great, but the optimizer state memory is still a huge bottleneck for larger models.
Lion8bit: Efficient, but not enough, and in my experience with small-batch setups, the training often becomes unstable and just blows up the model after a while.
Standard Adafactor: Very memory efficient, but existing implementations (PyTorch / HuggingFace) have rigid scheduling behaviors. PyTorch couples the internal LR schedule to training steps with no option to disable it. HuggingFace allows LR decoupling, but the second-moment decay still cannot be turned off. In long continuous training or fine-tuning, this leads to a "blunting" of the second-order moment estimate, weakening the optimizer's ability to adapt to new gradient distributions.
The Solution
To explore these issues further, I built an 8-bit Adafactor variant with:
Fully configurable scheduling and optional fixed beta2 behavior.
8-bit optimizer state representation.
CUDA-fused kernel implementation for performance.
Optional APOLLO-style low-rank subspace projection with Fira limiter to speed up convergence and stabilize updates.
Current Status & Experience
At the moment, this has largely become my default optimizer for diffusion model training, since it's much lighter than AdamW8bit and more stable than Lion8bit in my case.
A few interesting observations from my recent runs:
Faster Convergence: Enabling APOLLO-style projection seems to improve convergence speed and generalization compared to the standard Adafactor path.
Occasionally, changing projection subspaces can introduce small gradient spikes, but the built-in Fira Limiter for the APOLLO path handles it well, and I can relax the external gradient clipping.
About the Memory Usage
Here are the optimizer checkpoint sizes from my setup (not a full benchmark yet, but gives a good idea):
~1.7B Diffusion UNet:
AdamW8bit (bitsandbytes): ~3.22 GB
Adafactor (FP32): ~734 MB
Adafactor8Bit: ~188 MB (approx. 17× optimizer state reduction vs AdamW8bit.)
Text Encoder (Custom CLIP, mostly transformer weights):
AdamW8bit: ~1.04 GB
Adafactor (FP32): ~2.48 MB
Adafactor8Bit: ~1.75 MB (Standard Adafactor is already extremely memory efficient here, so quantization gains are smaller).
Getting Started:
pip install -U adafactor8bitGitHub [Documentation and source code]:
8-bit Adafactor Optimizer with Fused CUDA Kernels

Feel free to try it out and let me know how it behaves in your training! If this helps your training, a GitHub Star would be hugely appreciated! :)

