High Speed, FP16 Quality Flux.1 Dev Generation With Batched/Compiled HF Diffusers

Unsatisfied with the speeds I was getting out of comfyui, I modified a pure diffusers script for mass generating FLUX images on 24GB GPUs.

It works like this: You set a static resolution, CFG and such in a python script, and run it. It will auto download Flux, or you can modify it to use local files.

Once its initialized, the script will ask for a prompt as its "warming up." The script continuously queues the next prompt to run before the previous generation is done, to keep your GPU busy even when you're thinking about what to prompt. Generated images get dumped in the script's folder.

What's the trick? Huggingface's quanto library (to quantize to INT8 with high quality), torch.compile with max-autotune (to speed up diffusers), a few speeds hacks lifted from Stable Diffusion UIs like voltaML, and generating images in batches (I can manage 2-5 on my 3090 depending on the resolution). Saving images asynchronously and queuing prompts also helps.

The end result is way more t/s than a regular workflow, at the cost of flexibility.

I've attached the script. Nothing fancy, just thought I'd share it. shrug

Note" I've only tested this in 24GB GPUs. Unfortunately I have not figured out how to get quanto to play nice with auto offloading, but torch.compile does bring "sustained" vram usage down to 17GB, so it might work on 16GB GPUs with Window's GPU paging.