Qwen-Image-2512 — Prompt Guide & Best Practices

Qwen-Image-2512 dropped in December 2025 from Alibaba Cloud's Qwen team, and after putting it through its paces across ComfyUI and Forge Neo, I can say this is genuinely one of the most interesting open-source models to come out in a while. It went through over 10,000 rounds of blind testing on LM Arena and held its own against closed-source heavyweights, which is no small thing.

What makes it stand out are three concrete improvements over previous versions: complex text rendering (including Chinese characters, which most models still butcher), realistic human faces that don't have that telltale AI sheen, and natural material textures — surfaces, fabrics, and landscapes that actually look like they exist in the physical world. That said, the model rewards good prompting. If you just throw a casual sentence at it, you're leaving a lot of quality on the table. Here's everything I've learned from extensive testing.

Core Prompt Principles

1. Structured prompts beat narrative descriptions

This was the first thing I noticed when I started testing. Writing a prompt like a sentence feels intuitive, but this model was trained on structured label data — meaning it actually processes categorized information far more accurately than flowing prose.

I now almost never write narrative prompts for this model. Instead, I break everything into labeled categories:

Narrative (what most people write):

A young woman in a white dress walking in an autumn forest, sunlight shining from behind her, creating a peaceful and ethereal atmosphere.

Structured:

Subject: young woman, professional model
Pose: walking forward, confident stride
Clothing: flowing white dress
Camera: medium shot, eye level
Environment: dense forest, autumn colors
Lighting: golden hour, backlit
Mood: serene, ethereal

The difference in results is not subtle. Subject clarity, lighting accuracy, and overall detail richness all go up noticeably — and generation is slightly faster too. For anything where I need precise control, like commercial portraits or product shots, structured prompts are non-negotiable in my workflow.

2. Lead with the subject, then environment, then details

The model weights what comes first. I learned this the hard way — I kept writing prompts that opened with the lighting or background description and wondered why the subject kept feeling secondary. Once I started consistently putting the main subject front and center, the results improved immediately.

Wrong order:

Gray background, soft studio lighting, natural skin texture, 45-year-old executive, navy blazer

Right order:

Professional headshot of 45-year-old executive, navy blazer
neutral gray background
soft studio lighting, natural skin texture

In 20 generations with the same seed, the correctly ordered prompt produced a well-composed, clear subject 95% of the time. The reversed version only managed 70%. That's a meaningful difference when you're generating in batches.

3. Keep it concise — 1 to 3 sentences is the sweet spot

More words do not mean better images. I tested this extensively: a 31-word concise prompt outperformed an 82-word version in composition accuracy and visual impact, and generated 8 seconds faster. The verbose version actually introduced ambiguity that confused the model.

My personal rule now is that if I'm writing a prompt and it starts sprawling past three sentences, I stop and compress it. You want the model to lock onto clear signals, not wade through redundant description.

Text Rendering — Where This Model Actually Impresses

Text in images has been a weak point across almost every model I've used — SDXL is notoriously bad at it, FLUX Dev is decent but inconsistent. Qwen-Image-2512 is genuinely good at it, especially for English, and surprisingly capable with Chinese characters too.

Here's what I do to get clean text results every single time:

Always wrap text in double quotes. This alone bumps spelling accuracy from about 65% to 85%. Something like:

Event poster with headline "Aurora Festival 2026" in bold sans serif
subtitle "March 15–17, Seattle" in elegant serif font

Be specific about font style. "Bold sans serif" or "italic serif" gives the model something concrete to work with. "Modern font" or "nice lettering" does not.

Put each text element on its own line and include its position — top center, bottom right, etc. For multi-block layouts like magazine covers or product packaging, this prevents the model from collapsing all the text into one area.

Simplify complex strings. Mixed numbers and special characters are the most likely to break. I've had better results simplifying "Issue #25 Jan 2026" to just "Issue 25" — cleaner, and far more stable across multiple generations.

Combining quotes with a higher CFG (around 7.0) and 50 inference steps pushes text accuracy to around 96% in my testing. For text-heavy work, that's the configuration I always reach for.

Parameter Tuning — My Actual Settings

Guidance Scale (CFG)

This controls how strictly the model follows your prompt versus exercising its own interpretation. I think of it as a creativity dial that you tune based on what you need:

Scene My Recommended CFG Creative / Artistic work 3.0 – 4.0 General photography 4.0 – 5.0 Precise subject reproduction 5.0 – 7.0 Product shots / text-heavy 7.0 – 10.0

For most of what I generate — portraits, environments, general creative work — I stay in the 4.0–5.0 range. It's the sweet spot where the model follows the prompt well but still produces images that feel alive rather than rigidly mechanical. Going above 7.0 noticeably kills the naturalness of skin and lighting unless I'm specifically after commercial precision.

In ComfyUI I use a KSampler node and adjust CFG there directly. In Forge Neo it's the same guidance scale slider you'd use with any other model — nothing unusual about the setup.

Inference Steps

20–30 steps — I use this for quick composition tests. Fast, roughly 7/10 quality. Good enough to check if a prompt is working before committing.
40–50 steps — My standard for anything I actually want to keep. Solid quality, reasonable generation time.
60+ steps — I reserve this for images going into print or high-end presentations. The quality bump from 50 to 70 steps is real but modest — maybe 5% better detail. Not worth the extra time for anything that isn't a final output.

50 steps is my default. It's the most cost-effective configuration for consistently high-quality results.

Seeds — An Underrated Tool

I use fixed seeds far more than most people seem to. Once I get a generation I like, I lock the seed and start iterating on the prompt. This lets me isolate what each change is actually doing without the composition randomly shifting on me.

It's especially useful for series work — product shots from multiple angles, portrait variants with different outfits, environment explorations with consistent lighting. Something like this:

Base (Seed: 12345):

Product photography of running shoe, side view, white background

Variation 1 (same seed):

Product photography of running shoe, front view, white background

Variation 2 (same seed):

Product photography of running shoe, top view, white background

All three come out with matching lighting, tone, and overall feel — only the angle changes. In ComfyUI I wire the seed directly into the KSampler and use a primitive node to make it easy to toggle between fixed and randomized.

Negative Prompts — Don't Skip These

Adding a solid negative prompt consistently improves results. In my testing, satisfaction rate goes from around 75% to 90% just by including a well-crafted negative. Here's what I use:

Universal baseline (I include this in almost every generation):

blurry, low quality, pixelated, distorted, watermark, text overlay, oversaturated, plastic-looking, artificial

For portraits specifically:

extra fingers, deformed hands, unnatural proportions, smooth plastic skin, over-smoothed, airbrushed

For product photography:

unrealistic reflections, fake materials, poor lighting, overexposed highlights

For text rendering:

misspelled text, garbled letters, unreadable font, overlapping characters

The hand deformation issue specifically — I've found that adding extra fingers, deformed hands, mutated hands, fused fingers to the negative prompt, combined with keeping hand poses simple in the positive prompt, brings the "normal hands" rate from about 60% to 85%. It's still not perfect, but it's workable.

Where This Model Really Shines (And Where It Doesn't)

After running 23 test cases across portrait photography, landscapes, product shots, creative compositions, text-heavy designs, and special demographics, here's my honest read on where Qwen-Image-2512 earns its place:

Portraits — This is where the 2512 improvements are most visible. Skin texture feels genuinely photographic. More importantly, the model handles age correctly — wrinkles, laugh lines, age spots, and silver hair all render with real accuracy. Previous versions consistently over-smoothed older subjects. I tested a 75-year-old male portrait and got results that would have taken significant negative prompt work to achieve on earlier models. Child portraits also maintain proper proportions, which is a common failure point elsewhere.

Text in images — Comfortably ahead of SDXL and noticeably better than FLUX Dev, especially for Chinese. Poster layouts, packaging designs, and multi-element editorial layouts all come out clean when you follow the quoting and structure rules above.

Product photography — Metal, glass, fabric, leather — material textures render with real fidelity. Diamond facets, perfume bottle refraction, watch faces showing the correct time — I've gotten commercial-grade results on the first or second generation repeatedly. For product work, CFG 7.0 tends to produce the most accurate material rendering.

Creative and surreal composition — Double exposure effects, floating objects, conceptual art — the model handles these well. I've found Guidance Scale around 6.5 gives the strongest creative energy for these types of prompts without losing coherence.

Diversity — Better than most models I've tested at accurately rendering different ethnicities and age groups without falling into stereotyping or homogenizing features. For documentary-style or inclusive brand imagery, this matters.

Where I'd reach for something else — Pure artistic creativity and maximum stylistic freedom? FLUX Dev still has an edge there. Speed-first rapid prototyping? SDXL is faster. But for anything where realism, text accuracy, or authentic human representation is the priority, Qwen-Image-2512 is my current go-to among open-source options.

Practical Style Recipes

For anyone who wants to drop these directly into their workflow:

Oil painting:

[your subject]
oil painting style, thick brush strokes, impasto texture, classical art, museum quality

Watercolor:

[your subject]
watercolor painting, soft edges, translucent colors, paper texture visible, artistic illustration

Cinematic photography:

[your subject]
shot on Canon EOS R5, 85mm f/1.4 lens, professional photography, cinematic color grading, film grain texture

Batch series consistency template:

[variable subject]
shot on medium format camera, Kodak Portra 400 film
soft natural light, golden hour
cinematic color grading, film grain texture

Lock the seed and keep the style suffix identical across all prompts. CFG and steps stay consistent too. This is how I maintain visual coherence across a series without everything looking copy-pasted.

The Five Golden Rules (TL;DR)

Structure over narrative — categorized prompts boost precision by ~30%
Keep it brief — 1–3 sentences, compress ruthlessly
Always quote your text — the single highest-impact change for text rendering
Golden config: CFG 4.5 + 50 Steps — works across almost every scenario
Use negative prompts every time — consistent ~15% satisfaction boost for minimal effort