This test aims to evaluate the performance of the NVIDIA RTX 4080 Super with only 16GB of VRAM by comparing the time difference between running Flux.1 Dev/Schnell models in FP16 and FP8 modes using ComfyUI. Is it true that running Flux.1 in FP16 mode with 16GB VRAM is not efficient in practical applications?

Test Environment

Hardware Configuration

CPU: Intel® Core™ i7 processor 14700K
Memory: 64GB (32GBx2) 6000MHz
GPU: MSI GeForce RTX™ 4080 SUPER VENTUS 3X OC 16GB GDDR6X

Software Configuration

Operating System: Windows 11
WebUI: ComfyUI
Model: Flux.1
Model Versions
- Dev: Suitable for high-quality image generation tasks that require a higher number of iterative steps.
- Schnell: Designed for quickly generating images within 1–4 steps, ideal for scenarios where speed is a top priority.
Precision Modes: fp16 (Official Version) and fp8 (ComfyUI Version)

ComfyUI Configuration and Prompt Example

In the ComfyUI-based Flux FP16 and FP8 workflows, only the Steps parameter was modified during testing; all other basic parameters remained unchanged.

Prompt: A cute, glowing version of SpongeBob SquarePants, designed with irresistible, large eyes. His body appears to be cracking like a molten volcano, with cracks glowing in bright, galaxy-like colors — shades of deep purple, blue, and bright orange. The glow emanates from within the cracks, giving an ethereal and mesmerizing effect, as if SpongeBob is infused with cosmic energy. The background is dark, making the glow and molten effect stand out, and small, floating star-like particles surround him, enhancing the galaxy and volcanic theme. The scene blends cuteness with a cosmic, otherworldly vibe.
Seed: 755017144359295
Image size: 1024 x 1024px
Sampler: euler
Scheduler: simple

Test Methodology

Under the specified hardware setup, I conducted performance tests using the Flux.1 Dev and Flux.1 Schnell model versions, combined with FP16 and FP8 precision modes. The generation times were measured for various step counts as follows:

Flux.1 Dev: 20, 30, 40, and 50 steps
Flux.1 Schnell: 1, 2, 3, and 4 steps

Note: Each test excludes the time taken to load the model. The reported generation times refer only to the time within kSampler.

Performance Data Analysis

Flux.1 Dev Model (Steps 20–50)

Average Speed Improvement: 38.83%
Average Time Saved: 24.51 seconds

Observations and Analysis:

Generation Time Increases Linearly with Steps: As the number of steps increases, the generation time shows a linear growth trend.
Significant Speed Boost in fp8 Mode: Compared to fp16 mode, fp8 mode offers an average speed boost of about 38.83%, with a maximum increase of 42.12%.
More Steps, More Time Saved: At 50 steps, fp8 mode is nearly 40 seconds faster than fp16 mode.
Suitable for High-Quality Generation Tasks: The Flux.1 Dev model can produce higher-quality images at higher step counts, making it ideal for applications with high image quality requirements.

Flux.1 Schnell Model (Steps 1–4)

Average Speed Improvement: 37.60%
Average Time Saved: 2.07 seconds

Observations and Analysis:

Rapid Generation Capability: The Flux.1 Schnell model is designed to generate images within 1–4 steps, completing the process in a very short time.
fp8 Mode Has Significant Advantages Even at Lower Steps: Despite the low number of steps, FP8 mode still achieves an average speed increase of about 37.60%.
Time Savings Increase with Steps: From 1 step to 4 steps, the time saved increases from 1.01 seconds to 3.36 seconds.
Ideal for Real-Time Applications: The Schnell model is perfect for scenarios requiring quick responses, such as real-time image processing or interactive generation.

The images generated during the test

Conclusion

This test highlights the performance and efficiency of the NVIDIA RTX 4080 Super when using the Flux.1 model under different precision settings. The data clearly demonstrates that FP8 mode offers significant time-saving benefits, with an average 38% improvement in generation speed.

Key Findings

Dev Model Performance:

At 20 steps, images are relatively simple, but their complexity and richness improve significantly at 50 steps.
FP16 mode at 50 steps takes 94.77 seconds, excluding model loading times, which further extend the total duration.
Optimal Balance: Using FP8 with 30 steps provides the best trade-off between speed and image quality, making it the recommended configuration for high-quality outputs.

Schnell Model Performance:

Designed for fast generation, the Schnell model exhibits negligible differences in image quality across FP16 and FP8 modes.
Time savings in FP8 mode are around 2 seconds, which may seem minor but are impactful for workflows requiring rapid or high-volume image generation.
FP8 Mode Ideal: Schnell users prioritize speed, making FP8 the clear choice for this model.

In the test with the Schnell model, I tried FP16 and FP8 with 20–50 steps and found that at higher step counts, the time difference compared to the Dev model was only about 2 seconds.

Final Verdict

The FP8 mode is designed to accommodate lower VRAM requirements, making it superior to FP16 in terms of efficiency in any scenario, without compromising image quality. Users can significantly enhance performance under limited VRAM conditions, whether focusing on high-quality detailed outputs (Dev) or real-time rapid generation (Schnell). This test highlights the importance of selecting the appropriate configuration based on the specific version of the Flux.1 model to achieve optimal performance.

Flux.1 FP16 vs FP8 Time Difference on RTX 4080 Super in ComfyUI