fp8? checkpoint? NF4? GGUF? What should I use?

Hardware config:

CPU: 12th Gen Intel(R) Core(TM) i5-12400F
Total RAM 32628 MB
Device: cuda:0 NVIDIA GeForce RTX 3060
Total VRAM 8192 MB

System config:

Platform: Windows 11
NVIDIA Driver: 560.94
Cuda compilation tools (Cuda Toolkit), release 12.6, V12.6.20
Python version: 3.11.9
pytorch version: 2.3.1+cu121

ComfyUI config:

ComfyUI Revision: 2568 [b29b3b86]
Set vram state to: NORMAL_VRAM

Model to test:

black-forest-labs/FLUX.1-dev
Kijai/flux-fp8 (fp8_e4m3fn) (fp8 diffusion)
Comfy-Org/flux1-dev (fp8 checkpoint)
lllyasviel/flux1-dev-bnb-nf4 (v1 & v2)
city96/FLUX.1-dev-gguf (Q4 & Q5)

LoRA to test:

Many creators train their LoRAs using different methods, tools, and parameters. I tried to pick 3 popular LoRAs to test their speed and compatibility.

Enhanced LoRA:
https://civitai.com/models/631986/xlabs-flux-realism-lora?modelVersionId=706528
Style LoRA:
https://civitai.com/models/128568/cyberpunk-anime-style?modelVersionId=747534
Celebrity LoRA:
https://civitai.com/models/121544/natalie-portman-sdxl-flux?modelVersionId=735449

* Attached is the workflow used for testing.

flux1-dev:

↓ First execution, including model loading time and CLIP text encoding.

100%|█████████████████████████████████████████████████████████| 20/20 [02:48<00:00,  8.45s/it]
Prompt executed in 287.57 seconds

↓ Execute again without reloading the model.

100%|█████████████████████████████████████████████████████████| 20/20 [02:28<00:00,  7.42s/it]
Prompt executed in 153.39 seconds

↓ flux1-dev + xlabs-flux-realism-lora.

100%|█████████████████████████████████████████████████████████| 20/20 [02:46<00:00,  8.35s/it]
Prompt executed in 187.76 seconds

↓ flux1-dev + cyberpunk-anime-style.

100%|█████████████████████████████████████████████████████████| 20/20 [03:17<00:00,  9.87s/it]
Prompt executed in 208.24 seconds

↓ flux1-dev + natalie-portman-sdxl-flux.

100%|█████████████████████████████████████████████████████████| 20/20 [03:18<00:00,  9.92s/it]
Prompt executed in 239.02 seconds

flux1-dev-fp8 diffusion model:

↓ First execution, including model loading time and CLIP text encoding.

100%|█████████████████████████████████████████████████████████| 20/20 [01:57<00:00,  5.88s/it]
Prompt executed in 186.65 seconds

↓ Execute again without reloading the model.

100%|█████████████████████████████████████████████████████████| 20/20 [02:00<00:00,  6.02s/it]
Prompt executed in 123.75 seconds

↓ flux1-dev-fp8 diffusion + xlabs-flux-realism-lora.

100%|█████████████████████████████████████████████████████████| 20/20 [01:59<00:00,  5.98s/it]
Prompt executed in 134.85 seconds

↓ flux1-dev-fp8 diffusion + cyberpunk-anime-style.

100%|█████████████████████████████████████████████████████████| 20/20 [02:23<00:00,  7.18s/it]
Prompt executed in 169.46 seconds

↓ flux1-dev-fp8 diffusion + natalie-portman-sdxl-flux.

100%|█████████████████████████████████████████████████████████| 20/20 [07:52<00:00, 23.60s/it]
Prompt executed in 498.94 seconds

flux1-dev-fp8 checkpoint model:

↓ First execution, including model loading time and CLIP text encoding.

100%|█████████████████████████████████████████████████████████| 20/20 [02:03<00:00,  6.17s/it]
Prompt executed in 195.66 seconds

↓ Execute again without reloading the model.

100%|█████████████████████████████████████████████████████████| 20/20 [01:58<00:00,  5.91s/it]
Prompt executed in 122.51 seconds

↓ flux1-dev-fp8 checkpoint model + xlabs-flux-realism-lora.

100%|█████████████████████████████████████████████████████████| 20/20 [02:07<00:00,  6.39s/it]
Prompt executed in 144.82 seconds

↓ flux1-dev-fp8 checkpoint model + cyberpunk-anime-style.

100%|█████████████████████████████████████████████████████████| 20/20 [04:12<00:00, 12.63s/it]
Prompt executed in 280.27 seconds

↓ flux1-dev-fp8 checkpoint model + natalie-portman-sdxl-flux.

100%|█████████████████████████████████████████████████████████| 20/20 [18:06<00:00, 54.35s/it]
Prompt executed in 1111.82 seconds

flux1-dev-bnb-nf4:

↓ First execution, including model loading time and CLIP text encoding.

100%|█████████████████████████████████████████████████████████| 20/20 [01:37<00:00,  4.89s/it]
Prompt executed in 342.23 seconds

↓ Execute again without reloading the model.

100%|█████████████████████████████████████████████████████████| 20/20 [01:37<00:00,  4.88s/it]
Prompt executed in 103.47 seconds

↓ flux1-dev-bnb-nf4 + xlabs-flux-realism-lora:

!!! Exception during processing !!! .to() does not accept copy argument

↓ flux1-dev-bnb-nf4 + cyberpunk-anime-style:

!!! Exception during processing !!! .to() does not accept copy argument

↓ flux1-dev-bnb-nf4 + natalie-portman-sdxl-flux:

!!! Exception during processing !!! .to() does not accept copy argument

flux1-dev-bnb-nf4-v2:

↓ First execution, including model loading time and CLIP text encoding.

100%|█████████████████████████████████████████████████████████| 20/20 [09:25<00:00, 28.27s/it]
Prompt executed in 611.85 seconds

↓ Execute again without reloading the model.

100%|█████████████████████████████████████████████████████████| 20/20 [05:41<00:00, 17.06s/it]
Prompt executed in 348.14 seconds

↓ flux1-dev-bnb-nf4-v2 + xlabs-flux-realism-lora:

!!! Exception during processing !!! .to() does not accept copy argument

↓ flux1-dev-bnb-nf4-v2 + cyberpunk-anime-style:

!!! Exception during processing !!! .to() does not accept copy argument

↓ flux1-dev-bnb-nf4-v2 + natalie-portman-sdxl-flux:

!!! Exception during processing !!! .to() does not accept copy argument

flux1-dev-Q4_0:

↓ First execution, including model loading time and CLIP text encoding.

100%|█████████████████████████████████████████████████████████| 20/20 [02:04<00:00,  6.22s/it]
Prompt executed in 157.69 seconds

↓ Execute again without reloading the model.

100%|█████████████████████████████████████████████████████████| 20/20 [02:03<00:00,  6.15s/it]
Prompt executed in 127.29 seconds

↓ flux1-dev-Q4_0 + xlabs-flux-realism-lora.

100%|█████████████████████████████████████████████████████████| 20/20 [02:05<00:00,  6.29s/it]
Prompt executed in 137.08 seconds

↓ flux1-dev-Q4_0 + cyberpunk-anime-style.

100%|█████████████████████████████████████████████████████████| 20/20 [02:46<00:00,  8.33s/it]
Prompt executed in 176.57 seconds

↓ flux1-dev-Q4_0 + natalie-portman-sdxl-flux.

100%|█████████████████████████████████████████████████████████| 20/20 [03:05<00:00,  9.27s/it]
Prompt executed in 194.95 seconds

flux1-dev-Q5_0:

↓ First execution, including model loading time and CLIP text encoding.

100%|█████████████████████████████████████████████████████████| 20/20 [02:45<00:00,  8.30s/it]
Prompt executed in 196.78 seconds

↓ Execute again without reloading the model.

100%|█████████████████████████████████████████████████████████| 20/20 [02:37<00:00,  7.87s/it]
Prompt executed in 162.54 seconds

↓ flux1-dev-Q5_0 + xlabs-flux-realism-lora.

100%|█████████████████████████████████████████████████████████| 20/20 [02:49<00:00,  8.47s/it]
Prompt executed in 182.82 seconds

↓ flux1-dev-Q5_0 + cyberpunk-anime-style.

100%|█████████████████████████████████████████████████████████| 20/20 [06:54<00:00, 20.73s/it]
Prompt executed in 425.56 seconds

↓ flux1-dev-Q5_0 + natalie-portman-sdxl-flux.

100%|████████████████████████████████████████████████████████| 20/20 [06:37<00:00, 19.86s/it]
Prompt executed in 407.55 seconds

↓ Comparison of models:

↓ Models with xlabs-flux-realism-lora

↓ Models with cyberpunk-anime-style. Bad hands appear.

↓ Models with natalie-portman-sdxl-flux. Extra arms appear.

↓ The time required for each model under different circumstances.

↓ Charts for easier reading.

Fastest continuously generated model:

flux1-dev-bnb-nf4: 103.47 seconds
fp8 diffusion, fp8 checkpoint, Q4_0: 120~129 seconds
flux1-dev: 153.39 seconds
Q5_0: 162.54 seconds
flux1-dev-bnb-nf4_v2: 348.14 seconds

Shortest loading time (1st execution - 2nd execution):

Q4_0, Q5_0: 30~50 seconds
fp8 diffusion, fp8 checkpoint: 51~100 seconds
flux1-dev: 101~200 seconds
flux1-dev-bnb-nf4, flux1-dev-bnb-nf4_v2: More than 201 seconds

LoRAs:

xlabs-flux-realism-lora can be used on basically any model
natalie-portman-sdxl-flux works best on flux1-dev and Q4_0
cyberpunk-anime-style is suitable for flux1-dev, fp8 diffusion and Q4_0
If you often use LoRA, it will be more stable to use flux1-dev and Q4_0. Using fp8 diffusion may be slower in some cases.
The flux1-dev-fp8 checkpoint and flux1-dev-Q5_0 are generally slower when using LoRA.
The slowness of flux1-dev-Q5_0 may be caused by insufficient VRAM.
Until this test, ComfyUI did not support flux1-dev-bnb-nf4 and flux1-dev-bnb-nf4_v2 using LoRA.
Maybe celebrity LoRA requires more details, so it consumes more resources.

Controlnet to test:

XLabs-AI/flux-controlnet-collections/flux-canny-controlnet_v2.safetensors

↓ flux1-dev, strength 0.7

Sampling: 100%|███████████████████████████████████████████████| 25/25 [05:04<00:00, 12.19s/it]
Prompt executed in 356.17 seconds

↓ flux1-dev-fp8, strength 0.7

Sampling: 100%|███████████████████████████████████████████████| 20/20 [10:59<00:00, 32.97s/it]
Prompt executed in 680.51 seconds

XLabs-AI/flux-controlnet-collections/flux-canny-controlnet.safetensors

↓ flux1-dev-fp8, strength 0.9

Sampling: 100%|███████████████████████████████████████████████| 20/20 [11:29<00:00, 34.49s/it] 
Prompt executed in 720.66 seconds

The first test uses xLabs-AI’s preset workflow. I gave up testing XLabs Sampler when I saw its speed. I'll wait for other methods.

fp8? checkpoint? NF4? GGUF? What should I use?

Hardware config:

System config:

ComfyUI config:

Model to test:

LoRA to test:

flux1-dev:

flux1-dev-fp8 diffusion model:

flux1-dev-fp8 checkpoint model:

flux1-dev-bnb-nf4:

flux1-dev-bnb-nf4-v2:

flux1-dev-Q4_0:

flux1-dev-Q5_0:

Fastest continuously generated model:

Shortest loading time (1st execution - 2nd execution):

LoRAs:

Controlnet to test:

Comments