Sign In

Z-Image Turbo [TensorCoreFP8]

45

277

20

Updated: Nov 30, 2025

base model

Verified:

SafeTensor

Type

Checkpoint Trained

Stats

278

0

Reviews

Published

Nov 28, 2025

Base Model

ZImageTurbo

Hash

AutoV2
576608297B

License:

Yes. 50% smaller, and 50% FASTER!

This is a quantized Z-Image Turbo that supports latest ComfyUI features:

  • FP8 scaled: Better than pure FP8. 50% smaller than BF16.

  • Mixed precision: keep important layers in BF16.

  • FP8 tensor cores support: do calculations in FP8 directly, instead of BF16. Much faster (+50% it/s) than BF16 and classic FP8 scaled models.

Recommend ComfyUI v0.3.76 (not released yet, see below) to get maximum optimizations.

About Z-image: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo


In short:

ComfyUI recently added many new features around FP8 scaled model. The most important one would be FP8 tensor core support.

And they released Flux.2 fp8 scaled model (named "fp8mixed", seems from Nvidia) with those new features enabled to make Flux.2 smaller and faster.

No "fp8mixed" version for Z-Image so far. That's not fair! Fine, I will do it myself.

I added the "TensorCoreFP8" suffix instead of "fp8mixed" because it's from ComfyUI code, and it's more important than mixed precision.


Mixed precision:

Early and final and some middle layers are still BF16. That's why this model is about 1GB larger than classic FP8 model.


Post-training calibrated and FP8 tensor core support:

If you have a newer GPU (Nvidia: RTX 4xxx and later, AMD: gfx1200, gfx1201, gfx950):

Those GPUs have native hardware FP8 calculation support. This model has post-training calibrated metadata. ComfyUI will automatically utilize those fancy tensor cores and do calculations in FP8 directly, instead of BF16.

On 4090, comparing to using BF16 model:

  • classic FP8 scaled model: -8% it/s (fp8 -> bf16 dequantization overhead)

  • classic FP8 scaled model + torch.compile: +11% it/s

  • this model: +31% it/s

  • this model + torch.compile: +60% it/s

On 5xxx GPUs it should be faster than above because newer tensor cores and better fp8 support. Not tested.

AMD GPU not tested.

Welcome to share your results in the comment section.

To use torch.compile, I recommend torch.compile nodes from "ComfyUI-KJNodes".

However, about torch.compile: as of me writing this (11/28/2025), ComfyUI v0.3.75 has a small bug and can't torch.compile FP8 model that uses tensor cores. It has been fixed. So remember to update ComfyUI v0.3.76 and retry it in the future. Or switch to master branch for now.

If your GPU does not have FP8 tensor core:

No worries. This model can still save you ~50% VRAM.


FYI: This model (the way ComfyUi utilizing FP8 tensor cores and doing linear) is compatible with all kinds of attention optimizations (sage attention etc.). But this is another topic.