Sign In

If VRAM is limiting your training, you might want to try this optimizer

2

Jun 15, 2026

(Updated: 3 hours ago)

training guide
If VRAM is limiting your training, you might want to try this optimizer

I've been experimenting with different optimizer setups for training diffusion models over the last few days.

The original motivation was pretty simple: reduce optimizer memory usage enough so I could fit larger diffusion models on a single GPU.

The Problem with Existing Optimizers

While working on this, I ran into specific limitations with the usual suspects:

  • AdamW8bit: Works great, but the optimizer state memory is still a huge bottleneck for larger models.

  • Lion8bit: Efficient, but not enough, and in my experience with small-batch setups, the training often becomes unstable and just blows up the model after a while.

  • Standard Adafactor: Very memory efficient, but existing implementations (PyTorch / HuggingFace) have rigid scheduling behaviors. PyTorch couples the internal LR schedule to training steps with no option to disable it. HuggingFace allows LR decoupling, but the second-moment decay still cannot be turned off. In long continuous training or fine-tuning, this leads to a "blunting" of the second-order moment estimate, weakening the optimizer's ability to adapt to new gradient distributions.

The Solution

To explore these issues further, I built an 8-bit Adafactor variant with:

  • Fully configurable scheduling and optional fixed beta2 behavior.

  • 8-bit optimizer state representation.

  • CUDA-fused kernel implementation for performance.

  • Optional APOLLO-style low-rank subspace projection with Fira limiter to speed up convergence and stabilize updates.

Current Status & Experience

At the moment, this has largely become my default optimizer for diffusion model training, since it's much lighter than AdamW8bit and more stable than Lion8bit in my case.

A few interesting observations from my recent runs:

  • Faster Convergence: Enabling APOLLO-style projection seems to improve convergence speed and generalization compared to the standard Adafactor path.

  • Occasionally, changing projection subspaces can introduce small gradient spikes, but the built-in Fira Limiter for the APOLLO path handles it well, and I can relax the external gradient clipping.

About the Memory Usage

Here are the optimizer checkpoint sizes from my setup (not a full benchmark yet, but gives a good idea):

~1.7B Diffusion UNet:

  • AdamW8bit (bitsandbytes): ~3.22 GB

  • Adafactor (FP32): ~734 MB

  • Adafactor8Bit: ~188 MB (approx. 17× optimizer state reduction vs AdamW8bit.)

Text Encoder (Custom CLIP, mostly transformer weights):

  • AdamW8bit: ~1.04 GB

  • Adafactor (FP32): ~2.48 MB

  • Adafactor8Bit: ~1.75 MB (Standard Adafactor is already extremely memory efficient here, so quantization gains are smaller).

Getting Started:

pip install -U adafactor8bit

GitHub [Documentation and source code]:

8-bit Adafactor Optimizer with Fused CUDA Kernels

Unet_Tenserboard.png

Feel free to try it out and let me know how it behaves in your training! If this helps your training, a GitHub Star would be hugely appreciated! :)

2