UPDATE!

Though some of this guide is still relevant, most of the details have been revised and revisited using the newest version of Flux Training Scripts. The updated version of this article is Managing VRAM to Optimize Performance for Flux Training.

When the first "low" VRAM Flux trainer came out, I immediately jumped on it and started training—but training a LoRA took close to 12 hours. That's a lot of time to tie up your GPU. So I started scouring the internet, forums, and this site because I wanted to know more about how much I could get out of 12G of VRAM from my RTX 3060 during Flux training.

I had played with variations on batch sizes, gradient accumulation steps (gas), training resolutions, and learning rates, but I didn't know the best settings. After days of searching and seeing my questions repeated without clear responses, I felt like no one else knew, so I started experimenting. I've done training runs at every training resolution between 256 and 1024, attempting to max out batch sizes and gas to test what my GPU could take and whether or not they made a difference. If you are tired of reading already, you can see all of my results here. Knowing the values on the chart helped me cut training time down to 20-25% of what it was before. For everyone else, read on.

First, if you are not aware, a batch is the number of samples that are processed in a single forward and backward pass through the network. After processing a batch, the gradients are computed. In really simple terms, it is the number of images you shove in to be processed together in one step. The bigger the batch, the more VRAM you need. And the higher your batch size, the quicker your training will go -- but there are drawbacks as larger batch sizes also have problems.

As you increase the batch size, the noise in the gradient estimates decreases because the model sees more data before each update. This means you can safely use a higher learning rate since each update is based on a more accurate estimate of the direction the model should move in. It's a balancing act, so 2 or 4 works pretty well to make the training go well and allow you to increase your learning rate.

Gradient accumulation is a way to simulate a larger batch size (it loops the forward and backward passes) with lower VRAM requirements, but it also takes more time. It's a wash, but it works the same way as batch size to increase your learning rate. So its worth doing.

This is important because I wanted to know the optimal settings for my GPU. I tried to decrease my training time by testing what batch size or gas I could use at different resolutions. The linked spreadsheet shows what I found. Everything was tested with 2080 steps, but the chart shows how the various settings can decrease the number of steps to get the same results. In addition to getting faster results, you can also increase the learning rate up to the amount listed -- but it's better to play with that setting depending on your training focus.

Note: I'm not saying you should be training at any particular resolution -- only listing what's possible with a 3060.

Training Flux: Optimal Settings for a RTX 3060

UPDATE!