Z-Image-Turbo LoRA Training Setup – Full Precision + Adapter v2 (Massive Quality Jump)

After weeks of testing, hundreds of LoRAs, and one burnt PSU 😂, I've finally settled on the LoRA training setup that gives me the sharpest, most detailed, and most flexible results with Tongyi-MAI/Z-Image-Turbo.

This brings together everything from my previous posts:

Training at 512 pixels is overpowered and still delivers crisp 2K+ native outputs ((meaning the bucket size not the dataset))
Running full precision (fp32 saves, no quantization on transformer or text encoder) eliminates hallucinations and hugely boosts quality – even at 5000+ steps
The ostris zimage_turbo_training_adapter_v2 is absolutely essential

Training time with 20–60 images:

~15–22 mins on RunPod on RTX5090 costs $0.89/hr (( you will not be spending that amount since it will take 20 mins or less))

Template on runpod “AI Toolkit - ostris - ui - official”

~1 hour on RTX 3090 ((if you sample 1 image instead of 10 samples per 250 steps))

Key settings that made the biggest difference

ostris/zimage_turbo_training_adapter_v2
Full precision saves (dtype: fp32)
No quantization anywhere
LoRA rank/alpha 16 (linear + conv)
sigmoid timestep
Balanced content/style
AdamW8bit optimizer, LR 0.00025, weight decay (0.0001). Note : I'm currently in process of testing Prodigy optimizer.
steps 3000 sweet spot >> can be pushed to 5000 if careful with dataset and captions.

Full ai-toolkit config.yaml (copy config file exactly for best results)

ComfyUI workflow (use exact settings for testing/ test with bong_tangent also it works decently)
workflow

flowmatch scheduler (( the magic trick is here/ can also test on bong_tangent))

RES4LYF

UltraFluxVAE ( this is a must!!! provides much better results than the regular VAE)

Pro tips

1.Always preprocess your dataset with SEEDVR2 – gets rid of hidden blur even in high-res images

1A-SeedVR2 Nightly Workflow ((please be mindful and install this in a separate comfyui, as it may cause dependencies conflicts))

1B- Downscaling py script ( a simple python script I created, I use this to downscale large photos that contain artifacts and blurs. then upscale them via SeedVR2 eg. 2316x3088 that has artifacts or blur technically not easy to use but with this I downscale it to 60% then upscaling it with SeedVR2 with fantastic results. works better for me than the regular resize node in comfyui **note this is local script, you only need to replace input and output folders paths in the scripts as it does bulk resizing or individual, takes split of seconds to finish as well even for Bulk resizing)

2.Keep captions simple, don't over do it!

Previous posts for more context:

Try it out and show me what you get – excited to see your results! 🚀

PSA: this training method guaranteed to maintain all the styles that come with the model, for example :you can literally have your character in in the style of sponge bob show chilling at the crusty crab with sponge bob and have sponge bob intact alongside of your character who will transform to the style of the show!! just thought to throw this out there.. and no this will not break a 6b parameter model and I'm talking at strength 1.00 lora as well. remember guys you have the ability to change the strength of your lora as well. Cheers!!

🚨 IMPORTANT UPDATE ⚡ Why Simple Captioning Is Essential

I’ve seen some users struggling with distorted features or “mushy” results. If your character isn’t coming out clean, you are likely over-captioning your dataset.

z-image handles training differently than what you might be used to with SDXL or other models.

🧼 The “Clean Label” Method

My method relies on a minimalist caption.

If I am training a character who is a man, my caption is simply:

man

🧠 Why This Works (The Science) • The Sigmoid Factor

This training process utilizes a Sigmoid schedule with a high initial noise floor. This noise does not “settle” well when you try to cram long, descriptive prompts into the dataset.

• Avoiding Semantic Noise

Heavy captions introduce unnecessary noise into the training tokens. When the model tries to resolve that high initial noise against a wall of text, it often leads to:

Disfigured faces

Loss of fine detail

• Leveraging Latent Knowledge

You aren’t teaching the model what clothes or backgrounds are, it already knows. By keeping the caption to a single word, you focus 100% of the training energy on aligning your subject’s unique features with the model’s existing 6B-parameter intelligence.

• Style Versatility

This is how you keep the model flexible.

Because you haven’t “baked” specific descriptions into the character, you can drop them into any style, even a cartoon. and the model will adapt the character perfectly without breaking.

original post with discussion

Credit for:

Tongyi-MAI For the ABSOLUTE UNIT OF A MODEL

Ostris And his Absolute legend of A training tool and Adapter

ClownsharkBatwing For the amazing RES4LYFE SAMPLERS

erosDiffusion For Revealing Flowmatch Scheduler