If VRAM is limiting your training, you might want to try this optimizer

I've been experimenting with different optimizer setups for training diffusion models over the last few days.

The original motivation was pretty simple: reduce optimizer memory usage enough so I could fit larger diffusion models on a single GPU.

The Problem with Existing Optimizers

While working on this, I ran into specific limitations with the usual suspects:

AdamW8bit: Works great, but the optimizer state memory is still a huge bottleneck for larger models.
Lion8bit: Efficient, but not enough, and in my experience with small-batch setups, the training often becomes unstable and just blows up the model after a while.
Standard Adafactor: Very memory efficient, but existing implementations (PyTorch / HuggingFace) have rigid scheduling behaviors. PyTorch couples the internal LR schedule to training steps with no option to disable it. HuggingFace allows LR decoupling, but the second-moment decay still cannot be turned off. In long continuous training or fine-tuning, this leads to a "blunting" of the second-order moment estimate, weakening the optimizer's ability to adapt to new gradient distributions.

The Solution

To explore these issues further, I built an 8-bit Adafactor variant with:

Fully configurable scheduling and optional fixed beta2 behavior.
8-bit optimizer state representation.
CUDA-fused kernel implementation for performance.
Optional APOLLO-style low-rank subspace projection with Fira limiter to speed up convergence and stabilize updates.

Current Status & Experience

At the moment, this has largely become my default optimizer for diffusion model training, since it's much lighter than AdamW8bit and more stable than Lion8bit in my case.

A few interesting observations from my recent runs:

Faster Convergence: Enabling APOLLO-style projection seems to improve convergence speed and generalization compared to the standard Adafactor path.
Occasionally, changing projection subspaces can introduce small gradient spikes, but the built-in Fira Limiter for the APOLLO path handles it well, and I can relax the external gradient clipping.

About the Memory Usage

Here are the optimizer checkpoint sizes from my setup (not a full benchmark yet, but gives a good idea):

~1.7B Diffusion UNet:

AdamW8bit (bitsandbytes): ~3.22 GB
Adafactor (FP32): ~734 MB
Adafactor8Bit: ~188 MB (approx. 17× optimizer state reduction vs AdamW8bit.)

Text Encoder (Custom CLIP, mostly transformer weights):

AdamW8bit: ~1.04 GB
Adafactor (FP32): ~2.48 MB
Adafactor8Bit: ~1.75 MB (Standard Adafactor is already extremely memory efficient here, so quantization gains are smaller).

Getting Started:

pip install -U adafactor8bit

Recommended Usage (v0.2.1): Hybrid Routing & Fira Limiter

For most diffusion training setups, I currently recommend a hybrid routing strategy:

Embeddings → Momentum-free Adam style (factored=False, scale_parameter=False, d=1e9)
Using element-wise variance scaling in log-space provides fine-grained, per-token updates. This avoids the “cold-start” over-scaling issue of standard Adafactor when an embedding row is activated for the first time in a while, without the overhead of APOLLO projection. Pair this with an Adam-style learning rate (e.g., 1e-4) for best results.
2D weights (Linear layers) → APOLLO
In my experience, APOLLO generally converged faster and showed better generalization than the standard Adafactor path, while keeping memory usage similarly low.
Convolutions and other >2D tensors → Full-Rank (factored=False)
For finer gradient scaling, we can disable row/column factorization to keep the native spatial structure intact, maintaining independent variance for each spatial position in the convolution kernel.
Norms and biases → FP32 Adafactor, no weight decay
The standard recipe for stable training.
*New in v0.2.1: Optional 4-bit Packed Momentum
v0.2.1 adds optional 4-bit packed first-moment (beta1) support, allowing momentum to be enabled selectively for parameter groups where it provides the most benefit, while keeping the additional optimizer memory very small. In my current configuration, I enable beta1 only for dense weight matrices (Linear / Conv), while embeddings, norms, and biases remain momentum-free.

You can adjust apollo_rank based on memory budget:

0 — Disable APOLLO and use the standard Adafactor path.
16 — Default used by LLaMA-Factory.
256 — Recommended by the official APOLLO repository for 1B–7B models.

Enabling the Fira Limiter on the Adafactor paths suppresses gradient spikes, often making external clip_grad_norm_() unnecessary.

# Define learning rates
lr = 1e-3
lr_emb = 1e-4 # For Embedding layers, we use an Adam-style learning rate

def get_param_groups(model, lr_emb, weight_decay, apollo_rank=256):
    group_1d, group_embed, group_2d, group_nd = [], [], [], []

    for name, param in model.named_parameters():
        if not param.requires_grad: continue
        
        is_1d = param.ndim <= 1 or "bias" in name or "norm" in name
        # Match true Token Embeddings, excluding Position and Time Embeddings
        is_embedding = ("embed" in name.lower() 
                        and "position" not in name.lower() 
                        and "pos_embed" not in name.lower()
                        and "time" not in name.lower())
        
        if is_1d:
            group_1d.append(param)
        elif is_embedding:
            group_embed.append(param)
        elif param.ndim == 2:
            group_2d.append(param)
        else:
            group_nd.append(param)

    return [
        # 1. 1D / Sensitive: FP32, No Weight Decay
        {"params": group_1d, "weight_decay": 0.0, "quantize": False, "apollo_rank": 0},
        
        # 2. Embeddings: Recreating a momentum-free Adam
        {
            "params": group_embed, 
            "weight_decay": 0.0, 
            "quantize": False,
            "apollo_rank": 0,
            "factored": False,         # Enable element-wise variance
            "scale_parameter": False,  # Disable internal RMS scaling
            "d": 1e9,                  # Disable global Trust-Region clipping
            "lr": lr_emb               # Override global learning rate
        },
        
        # 3. 2D Weights: 8-bit quantization, Weight Decay, APOLLO low-rank projection
        {
            "params": group_2d, 
            "weight_decay": weight_decay, 
            "quantize": True, 
            "apollo_rank": apollo_rank,
            "beta1":0.9,               # Remove if minimizing optimizer memory is the priority.
        },
        
        # 4. >2D Weights: 8-bit quantization, Weight Decay, Full-Rank
        {
            "params": group_nd, 
            "weight_decay": weight_decay, 
            "quantize": True, 
            "apollo_rank": 0,
            "beta1":0.9,               # Remove if minimizing optimizer memory is the priority.
            "factored": False          # Disables factorization to preserve spatial structures, enabling finer gradient scaling.
                                       # Note: This increases state memory for >2D weights, depending on your model architecture.
                                       # If VRAM is constrained, reverting to factored=True is a safe alternative.
        },
    ]

model = MyModel().cuda()
optimizer = Adafactor8Bit(
    get_param_groups(model, lr_emb = lr_emb, weight_decay=1e-2, apollo_rank=256), 
    lr=lr, 
    # For continual learning or when using an external LR scheduler
    relative_step=False,              # Disable internal LR scheduling
    beta2=0.999,                      # Lock EMA window to prevent "blunting" over steps
    enable_fira_for_adafactor=True    # Enable Fira Limiter globally; external grad clipping can be safely removed
)

# With Fira enabled, torch.nn.utils.clip_grad_norm_() in your training loop can usually be removed

This is the configuration I currently use for most of my own diffusion training. If you're already using adafactor8bit>=0.2.1, feel free to use it as a starting point and adapt the routing to your own model architecture and memory budget.

GitHub [Documentation and source code]:

8-bit Adafactor Optimizer with Fused CUDA Kernels

Feel free to try it out and let me know how it behaves in your training! If this helps your training, a GitHub Star would be hugely appreciated! :)