Z+Z: Z-Image variability + Z-Image-Turbo quality/speed

Here is a ComfyUI workflow that combines the output variability of Z-Image (the undistilled model) with the generation speed and picture quality of Z-Image-Turbo (ZIT). This is done by replacing the first few ZIT steps with just a couple of Z-Image steps, basically letting Z-Image provide the initial noise for ZIT to refine and finish the generation. This way you get most of the variability of Z-Image, but the image will generate much faster than with a full Z-Image run (which would need 28-50 steps, per official recommendations). Also you get the benefit of the additional finetuning for photorealistic output that went into ZIT, if you care for that.

How to use the workflow:

If needed, adjust the CLIP and VAE loaders.
In the "Z-Image model" box, set the Z-Image (undistilled) model to load. The workflow is set up for a GGUF version, for reasons explained below. If you want to load a safetensors file instead, replace the "Unet Loader (GGUF)" node with a "Load Diffusion Model" node.
Likewise in the "Z-Image-Turbo model" box, set the ZIT model to load.
Optionally you can add LoRAs to the models. The workflow uses the convenient "Power Lora Loader" node from rgthree, but you can replace this with any Lora loader you like.
In the "Z+Z" widget, the number of steps is controlled as follows:
- ZIT steps target is the number of steps that a plain ZIT run would take, normally 8 or so.
- ZIT steps to replace is the number of initial ZIT steps that will be replaced by Z-Image steps. 1-2 is reasonable (you can go higher but it probably won't help).
- Z-Image steps is the total number of Z-Image steps that are run to produce the initial noise. This must be at least as high as ZIT steps to replace, and a reasonable upper value is 4 times the ZIT steps to replace. It can be any number in between.
width and height define the image dimensions
noise seed control as usual
On the top, set the positive and negative prompts. The latter is only effective for the Z-Image phase, which ends before the image gets refined, so it probably doesn't matter much.

Custom nodes required:

RES4LYF, for the "Sigmas Resample" node. This is essential for the workflow. Also the "Sigmas Preview" node is in use, but that's just for debugging.
ComfyUI-GGUF, for loading GGUF versions of the models. See note below.
ComfyUI_Essentials, for the "Simple Math" node. Needed to add two numbers.
rgthree-comfy, for the convenient PowerLoraLoader, but can be replaced with native Lora loaders if you like.

Here is a comparison of images generated with plain ZIT (top row, 8 steps), then with Z+Z with ZIT steps to replace set to 1 (next 4 rows, where e.g. 8/1/3 means ZIT steps target = 8, ZIT steps to replace = 1, Z-Image steps = 3), and finally with plain Z-Image (bottom row, 32 steps). Prompt: "photo of an attractive middle-aged woman sitting in a cafe in tuscany", generated at 1024x1024 (but scaled down here). Average generation times are given in the labels (with an RTX 5060Ti 16GB).

As you see, the plain ZIT run suffers from a lack of variabilty. The image composition is almost the same, and the person has the same face, regardless of seed. Replacing the first ZIT step with just one Z-Image step already provides much more varied image composition, though the faces still look similar. Doing more Z-Image steps increases variation of the faces as well, at the cost of generation time of course. The full Z-Image run takes much longer, and personally I feel the faces lack detail compared to ZIT and Z+Z, though perhaps this could be fixed by running it with 40-50 steps.

To increase variability even more, you can replace more than just the first ZIT step with Z-Image steps. Here's a comparison with ZIT steps to replace = 2.

I feel variability of composition and faces is on the same level as the full Z-Image output, even with Z-image steps = 2. However, using such a low number of Z-Image steps has a side effect. This basically forces Z-Image to run with an aggressive denoising schedule, but it's not made for that. It's not a Turbo model! My vague theory is that the leftover noise that gets passed down to the ZIT phase is not quite right, and ZIT tries to make sense of it in its own way, which produces some overly complicated patterns on the person's clothing, and elevated visual noise in the background. (In a sense it acts like an "add detail" filter, though it's probably unwanted.) But this is easily fixed by upping the Z-Image steps just a bit, e.g. the 8/2/4 generations already look pretty clean again.

I would recommend setting ZIT steps to replace to 1 or 2, but just for the fun of it, here's what happens if you go higher. This is the outcome of ZIT steps to replace = 4.

The issue with the visual noise and overly intricate patterns is becoming very obvious now, and it takes quite a number of Z-Image steps to alleviate that. As there isn't really much added variability, this only makes sense if you like this side effect for artistic reasons. 😉

One drawback of this workflow is that it has to load the Z-Image and ZIT models in turn. If you don't have enough VRAM, then this can add considerably to the image generation times. That's why the attached workflow is set up to use GGUFs. With 16GB VRAM, then both models can mostly stay loaded in the GPU. If you have more VRAM, you can try using the full BF16 models instead, which should lead to some reduction in generation time - if both models can stay in VRAM.

Technical Note: It took some experimenting getting the noise schedules for the two passes to match up. The workflow is currently fixed to use the Euler sampler with the "simple" scheduler, I haven't tested with others. I suspect the sampler can be replaced, but changing the scheduler might break the handover between the Z-Image and ZIT passes.