Sign In

Max quality Qwen Edit 2511 outputs: minimal workflows + lots of info

Updated: May 29, 2026

tool

Download

1 variant available

Archive Other

17.78 KB

Verified:

Type

Workflows

Stats

300

Reviews

Published

May 28, 2026

Base Model

Qwen

Hash

AutoV2
A6C06840EF

Intro

Alright this has been a long time coming. I'm the dude who figured out Qwen Edit 2509 a while back, and I've been on-and-off trying to figure out the same for 2511. Results in Comfy have always been worse than the examples shown by the Qwen team, and worse than the official Qwen chat implementation online. Well, I finally cracked it and it only took 5 months lol.

Anyway, turns out Qwedit 2511 is fucking sick. IMO it particularly excels at making new shots of characters while maintaining their likeness.

As usual, I'll start off with all the setup stuff at the top and then give an explanation + tips & info below that. Also I'm gonna be calling Qwen Edit "Qwedit" most of the time.

The posted images are all raw outputs from Qwedit, without being upscaled (despite mentioning it later in this post). They're also all done with only 20 steps instead of the hypothetical 30 I'd do if I wasn't planning to upscale them. Read further for more on that too.

The ref images were all made with Z-image base (workflow here), except for the anime one which came from Anima (workflow here).

What is this

These are minimalistic workflows for Qwen Image Edit 2511 that give the highest quality outputs. All other Qwen Edit workflows are now bad, use this one until those get updated with the new info. Aside from generally improving output quality (by a LOT), these changes also enable high-res edits and have better prompt adherence.

As for why, basically ComfyUI has some serious issues with how it's implemented Qwen Edit and there aren't any workflows out there (that I've found) which have resolved them. These issues result in poor prompt adherence and low resolution/quality outputs. Thankfully the fix is fairly straightforward.

The configuration for this is 100% portable and can be migrated to existing workflows to make them better; it works by changing how the reference inputs are handled, and uses 100% native comfy nodes. Feel free to update other workflows without crediting me.

There are also a couple of (optional) related upscale workflows inside; read on for info on why those are here.

Workflows

Normal Workflows

These are separated into single / 2 image workflows. It's done this way because the setup for multi-image is complicated and I didn't want to force you to use a ton of custom nodes to make it useable all-in-one.

These do still use one custom node for quality-of-life. Minimal quality-of-life that is; I promise there's nothing unnecessary or pointless.

Dev Workflows

Linking these separately to avoid cluttering the post attachment.

These are the same as the normal workflows but without any quality-of-life nodes or 'helpful' stuff. Grab these if you want to copy the logic over to other workflows, or if you just an easier view of how it works without any clutter.

I do not recommend using the dev workflows for actual gens because you will constantly forget to manually adjust stuff correctly.

Dev Single Image

Dev 2 Image

Models

Main Model

qwen_edit_2511_fp8

or GGUF versions

  • Important: the FP8 version of Qwedit is much higher quality than the Q8 GGUF, always use FP8 if you can. Only use the GGUFs if you need to use quants lower than Q8.

  • FP8 is 22GB, so you'll need a combined ~26GB of RAM + VRAM to run it

    • You don't need 24GB of VRAM to run it thanks to ComfyUI's blockswapping, but the less VRAM you have the slower it'll run

  • Only use Q6 & lower quants if you absolutely have to; the quality will noticeably go down

Goes in models/diffusion_models

Text Encoder

Use only the normal FP8 text encoder with Qwedit; abliterated/GGUF encoders will reduce your output quality.

qwen_2.5_vl_7b_fp8

Goes in models/text_encoders

VAE

qwen_image_vae

Goes in models/vae

Loras

You can use them as normal, just load them however you normally would. I left out lora loader nodes to avoid cluttering the workflow.

It's worth noting that many Qwen Image loras work with Qwen Edit too, but you'll need to test them individually to be sure.

Lightning Loras - BAD

All the lightning loras / distils for Qwedit (that I've tested) are shit and make your outputs look bad, so I'm not linking them here. The main issue is the same as with Klein Distilled: it makes people's skin look like plastic.

But you can technically use them. Don't do it tho. But you can if you want. But don't.

Alternative: if you want to cut your gen time down while testing prompts, just set it to 10 steps instead of 20, then go back to 20 once you're satisfied your prompt is correct. It'll still work fine, the quality just dips.

Custom Nodes

LayerStyle - A set of handy nodes that manipulate images. We're just using this for its image scaling node which allows you to scale by an image's long edge while maintaining divisibility by 16. You can skip this if you want to use a different scaling method, but you'll need to fix the workflow switch for scaling if you do.

SeedVR2 (OPTIONAL) - Only get this if you want to use the seedvr upscale workflow that's included.

How To Use

How To Use Part 1 - Basic Options

There are instructions in the workflow as well, but there's more detail here. Read parts 2 & 3 as well, they're important.

It works just like a normal Qwedit workflow, but has a couple of extra options available. This section just tells you what they are and how to use them.

Enhance with Double Ref

This is a switch that turns on double-ref mode. This feeds your input images in TWICE to the model, and generally produces much higher quality results. Downside? It takes about 50% longer to gen.

I recommend leaving this on 100% of the time for single-image prompts, unless you're just messing around and want speed. It is ALWAYS better for single image prompts, and will improve everything from prompt adherence to output clarity.

For multi-image prompts, it usually increases adherence but sometimes reduces it. So, if you're doing multi-image stuff I recommend switching this on/off as needed based on how it's going with your prompt.

Input Scale

When off, your image doesn't get scaled (it still gets cropped to be divisible by 16). When on, the long edge of your image gets scaled to the number you put in the box. For example, if you feed in a 2560x1440 image and set the scale to 1920 it will scale your image to 1920x1080. That will then get cropped to 1920x1072 so it's divisible by 16.

Custom Output Size

When the switch is off, your output image will be the same size as your input image (after it's been scaled). If you turn this switch on, it will instead output an image with the dimensions you specify.

As a general rule, you should try to set your scales to be similar along at least one edge. For example, a 1920x1440 input image and a 1024x1440 input image are both suitable for a 1440x1440 output image. You can be more flexible with this if you know what you're doing.

How To Use Part 2 - Multi-image Prompting Requirement

This section is not a prompting guide (that's further below). This is about an actual requirement for prompting multi-image stuff. It is NOT required for single-image prompts.

You do multi-image prompts like normal, except you need to write a very basic description of your input images. Qwedit needs you to do this in order to know which image is which. I explain why in detail later.

You may find this slightly annoying, but I guarantee you it's dramatically better than using Qwedit the normal way that other workflows do - and it's pretty easy.

The format:

  • At the start of your prompt, write an extremely simple description for each of your input images; one sentence for each

  • Start each sentence with "Picture 1:", "Picture 2:", etc

  • You must write it this way because Qwedit was trained on this exact format

  • Afterwards, write your actual prompt as usual; you can refer to your input images as "Picture 1" and so on

The model uses these descriptions to understand which input picture is which, and it works better with SIMPLE descriptions. You only need to help it know which one is which, it doesn't need a full rundown.

Examples

Picture 1: a man wearing a t-shirt. Picture 2: a top hat. Make the man in Picture 1 wear the top hat from Picture 2.
Picture 1: a living room. Picture 2: a woman. Put the woman from Picture 2 into the living room in Picture 1.
Picture 1: a man wearing a professional suit. Picture 2: a man wearing a superhero outfit. Make the man in Picture 1 wear the outfit from Picture 2.

How To Use Part 3 - Upscaling

Because the qwen VAE tends to put a subtle halftone pattern over images (see limitations just below this section), I recommend downscaling and then re-upscaling your images afterwards. A big benefit of being able to work at high res with the edit model is that you rarely lose any detail doing this.

This eliminates the halftone pattern if you're using something like seedvr, or at least reduces it if you're using other upscalers. I think seedvr is best for this, but it's very beefy and hard to run on older GPUs.

Note: the workflow is set to do 20 steps of inference. It actually gives sharper results at 30 steps, but I don't bother with that because it takes longer and I down-upscale them afterwards anyway. If you aren't planning on down-upscaling them, you might consider doing 30 steps for the extra sharpness.

Seedvr2 sometimes gives better output at 0.5x downscale, and other times 0.75, so the workflow is configured to run BOTH for you to pick which one turned out best.

Normal upscalers are a bit different; a relatively small downsize to something like 1920p -> 1600p is usually reasonable, before then running the upscaler. Play around with it. The non-seedvr workflow has a longest_edge scale option so you can tweak the number specifically.

My preferred regular upscaler is 4x Nomos2 HQ DAT2, but you can use whatever.

Examples of upscaling:

Here's the pic raw output of the robot-arm girl in a dress from the post: https://ibb.co/B5jhrsL9 (if you zoom in you'll see the qwen halftone pattern, it looks like a grid)

Here's the pic after it's been run through seedvr after a 0.75x downscale: https://ibb.co/hJcn2f5t

Here's the pic after it's been run through a regular Nomos2 upscale after a downscale to 1600p: https://ibb.co/Kc2YSbVc

Limitations of Qwen Edit

Limitation 1

The Qwen VAE will often put a subtle halftone grid pattern over your images. It's noticeable if you zoom in, and more noticeable at higher resolutions. This is a feature of pretty much every Qwen-based model, but it's particularly present with the Edit model.

You can easily resolve this by downscaling your image a bit, then re-upscaling it again to your desired resolution. The section above explains this in better detail.

It sounds like a big issue, but the downscale-upscale trick solves it easily and it's not always necessary either. The higher quality your input image, the less bad the halftone pattern will be.

Limitation 2

Qwedit struggles with complex multi-image stuff most of the time (it's just a limitation of the model). This workflow makes it much better, but it's still not great. You'll have to play around with it to know which things work and which things don't.

I recommend using a different model if you want to do anything complicated with multi-image.

Limitation 3

It takes a while to gen stuff if not using the lightning loras. Very similar to the time it takes with Klein 9B base. The double-ref trick increases it by roughly 50%, and multi-image edits take a lot longer.

For low res images (typical 1mpx size) it's pretty okay, around 50 seconds on a 5090 with the double-ref option turned on.

But then there's high-res stuff. Gen time scales non-linearly as you go higher. Going from 1024x1024 (1 mpx) to 1440x1440 (2 mpx) takes around 2.5x as long. Going from 1 mpx to 3 mpx is around 4x as long. 5 mpx is 9.5x as long. In conclusion, stick to 2-3 mpx unless you're cool with long-ass gen times. Stick around 1-2 mpx for multi-image gens, or turn off the double ref switch.

On the plus side, it's pretty reliable for single-image edits so you don't typically need to do many gens to get a good result.

Examples using a 5090:

  • Single-image edit @ 1024x1024 (1 mpx), double-ref OFF = 38 seconds

  • Single-image edit @ 1024x1024 (1 mpx), double-ref ON = 52 seconds

  • Single-image edit @ 1920x1088 (2 mpx), double-ref OFF = 91 seconds

  • Single-image edit @ 1920x1088 (2 mpx), double-ref ON = 131 seconds

  • Single-image edit @ 3072x1728 (5.3 mpx lol), double-ref ON = 550 seconds

  • Two-image edit @ 2560x1440 each, double-ref ON = serial killer behaviour

That's it for how-to! Read on for more tips & info, as well as an explanation of what the workflow is doing & why.

Explanation - what is this garbage and why is it so good?

There are three important things this workflow is doing that other workflows do not do. I'm going to call these The Comfy Problem, The VL Problem, and The Double Ref Enhancement.

The Comfy Problem

Comfy's native "TextEncodeQwenImageEditPlus" node is what most people use in their workflows. It handles your prompt and image inputs for you. It's pretty handy, except for the small problem that it's SHIT.

Do you work at Comfy? If so: GET YOUR SHIT TOGETHER AND FIX THIS NODE, IT'S SO EASY. Much respect to u tho, thanks for making ComfyUI.

The first issue is that this node resizes your image down to 1 megapixel, and you can't stop it from doing that. The second issue is that it does this with the AREA downscale method, which is so incredibly bad that I want to slap whoever implemented this node. The area downscale is what makes all of your output images blurry. The third issue is that it crops your dimensions to be divisible by 8, but they actually need to be divisible by 16.

Specifically, ComfyUI does this:

  1. Calculates 1 megapixel as 1024x1024, which is 1,048,576 pixels

  2. Calculates your new image dimensions to match that number of pixels, rounded to be divisible by 8

  3. Scales your image to those new dimensions using the AREA method

Why is all this bad?

  1. It's completely unnecessary; Qwedit can easily handle images of varying size, all the way up to 3 megapixels (or even higher for simple edits)

  2. The area downscale method makes images extremely blurry, and this is the primary reason all ComfyUI qwen edits give blurry images out. Yes it's literally this dumb, this huge problem would easily be solved by changing the word "area" to "lanczos" in the code, it's a one-word fix.

  3. Not even MS paint uses area downscale, wtf is wrong with you Comfy devs (much respect)

The Comfy Problem Solution

This workflow bypasses the the Comfy node entirely, allowing you to size your images however you want. And using chad lanczos scaling instead of loser area scaling. Magic.

Qwedit easily handles resolutions like 1440x1440 and 1600x1200. Every edit example in this post was done natively at 1920p, except for a few (which are labelled as such).

Really high resolutions (3mpx) sometimes have trouble with anatomy, but usually you can just do multiple gens and one of them will turn out fine.

If you're doing a simple in-place edit like changing an outfit, you can go VERY high. Here's an example edit done at 1728x3072, which is 5 megapixels: https://ibb.co/twCSWrjy (outfit change -> bikini top + short shorts)

The VL Problem

In the background, Qwedit 2511 uses a vision-language model (VL model) to describe your images, then gives those AI-generated descriptions to the edit model. It also re-interprets your instructions with these descriptions. Ostensibly this helps the model understand your input images better, leading to better results.

The problem? It doesn't lead to better results, it's bad. VL models aren't very good for this sort of thing because they don't know what to focus on. The VL describes your images in excruciating detail, totally overwhelming the edit model and leading to bad prompt adherence + weird outputs.

The Qwen team's official python code does this, and the ComfyUI "TextEncodeQwenImageEditPlus" node copies it exactly. No disrespect to the Comfy team on this one, they're doing what the Qwen team officially recommended.

The VL Problem Solution

Same solution as the previous problem: bypass the Comfy node entirely. This results in the VL step being completely ignored. No AI-generated descriptions get fed into the edit model.

For single-image edits, this is a 100% complete and total victory. The model performs way better without the crappy VL interpretation.

For multi-image edits, there's a small issue; this step is where the input images normally get labelled. Specifically, the VL outputs are fed into the model in the following exact format:

> Picture 1: <shitty VL description>
> Picture 2: <shitty VL description>

Look familiar? This is why we manually have to type the descriptions in for multi-image edits - otherwise the model doesn't actually know which image is which.

The upside is that the model works way better with simple descriptions, so cutting out the VL is still 100% the correct move. A 5 word description wins over whatever BS the VL model spews out, every time.

The Double Ref Enhancement

I really have no idea why this works so well, but basically if you feed in your reference images twice the model just works better. This was known back in 2509 days (hence the previous post linked at the top), and back then I didn't know why it worked either.

For single image edits it's ALWAYS better. And it's not just the quality, for some reason it even helps with prompt adherence. The interesting thing is that the difference is really, really significant. Here's the full list of stuff it improves:

  • Better prompt adherence

  • Sharper output images / more visual clarity

  • Improved consistency of objects & textures

  • Better resemblance of characters at different angles

  • More intelligent guesses, like what to add when outpainting or what's behind a removed object

For multi-image edits it can sometimes confuse the model a bit, but most of the time it confers all the same benefits listed above. I recommend switching it on & off randomly when you're doing multi-image stuff, just in case.

Note: there are a lot of different ways the input references can be handled. There are conditioning combine/concatenate nodes, you can pass the refs in a different order, you can change the negative conditioning input (read next section for that), etc. I A/B tested SIXTEEN different reference-handling combinations, and a bunch of smaller minor variations of those. Some of them worked, some of them didn't.

Of those sixteen combinations, two of them gave the best results; both of them are in this workflow, and you switch between them by turning the double ref method on & off.

So, don't fuck with the positive/negative conditioning & reference setup, it's very specific.

Extra Info: The "Conditioning Zero Out"

You may notice that the negative prompt input is the first reference image(s) and positive prompt fed into a "conditioning zero out" node.

Feeding the input images into the model's negative conditioning is required (it's just how Qwedit works). The only question is whether to feed in the positive prompt zeroed-out too, and whether the double ref should get fed in.

Through a lot of A/B testing, I can tell you that the way it's done here is the best. IDK why, it's just how it is. Some other combinations do technically work, but they degrade the output quality.

Prompting Advice

Other than just following the instructions in the workflow, here's some extra stuff.

Keep your prompts simple and direct

If you need to, point out details the model is missing or be more specific about stuff you do/don't want to change. For example, when doing a simple outfit swap it helps to specify you don't want their pose to change.

Using the robot arm girl, here's a prompt that doesn't follow this advice:

Change her outfit to a bikini top and short shorts.

While it sometimes does what we want, it tends to get confused by her robot arm and often changes her pose too: https://ibb.co/7dyKZttp (notice the human arm showing underneath the robot arm, and the pose change)

Here's a better prompt that gives a correct result 99% of the time:

Change her outfit to a bikini top and short shorts. Leave her robot arm and pose unchanged.

Now it does the right thing every time: https://ibb.co/DP9gZHVv

Avoid using fancy words or convoluted phrasing

Pretend you're talking to a child. The model will probably still understand you if you talk fancy, but why take the risk?

As an example, imagine you have a pic of a table with some plates on it.

Bad:

Place a red apple on the table, ensuring it's in the center and removing the plate that was in the same spot.

Good:

Replace the middle plate with a red apple.

Also good:

Remove the plate from the center. Put a red apple there instead.

If there's only one plate, this is even better:

Remove the plate, replace it with a red apple.

Adjusting Lighting

You may want or need to adjust the lighting in an image. Aside from being helpful in general, there are situations where Qwedit may simply not realise that something needs to be lit in a particular way (or re-lit when moved).

To do this, you need to know the magic word: relight

That is the actual magic word, you are 100% required to use it if you want to adjust lighting properly.

Specifically, follow this format:

Relight to <strength> <color> <direction>.

Strength - bright, dim, etc

Color - white, cool, warm, etc

Direction - diffuse, frontlit, backlit, etc

Tip: for basic lighting, use "white diffuse".

Examples:

Make a new shot of the man sitting in a chair in a kitchen. Relight to white diffuse.
Change the time of day to evening. Relight to warm backlit.

You don't actually need anything else in the prompt, you can just change the lighting of a pic like this:

Relight to bright cool frontlit.

Other Stuff

Euler-simple and no ClownsharKSampler?

No Clownshark this time. It reduces output quality quite a bit and doesn't confer any benefits. I also didn't find any sampler/scheduler combos that were better than euler/simple.

So, this is just one of those classic times where the ol' euler-simple wins the day. Let me know if you happen to know a better combo.

Image Quality in->out

Qwedit is very sensitive to the quality of your input image. If you feed in a grainy or blurry image, it will usually make your output image blurry or grainy too - even if it's an 'entirely new' shot with nothing copied over 1:1.

So, make sure to use HQ images. You can optionally use the upscale workflows to bump up the sharpness/quality of poor input images before you feed them in.

What about the flux super duper double resolution special VAE trick?

Doesn't work for 2511, it destroys your image. TBH it never really worked for 2509 either, but I won't argue with you if you liked it for some reason.

Making character references

Tip 1 - Make a nude ref (even for sfw stuff)

Qwen is killer for making character references. Other than using similar prompts to the examples I posted, my advice is to make a nude reference shot instead of a clothed one like I did.

I only made a clothed ref for the sake of propriety here, but a nude ref (or near-nude, like wearing plain white underwear) will be much easier to prompt into different outfits, and also gives Qwedit the maximum info needed to correctly size your character and know what they look like in clothing or doing different actions.

You do not need any loras to do this if you're just using it as a reference; the 'sensitive' parts will lack detail but that doesn't matter for new shots you make. If you don't want them nude, just request plain white underwear and, if relevant, a strapless white bra.

Nude ref = best ref.

Tip 2 - Make multiple zoom levels, use the thighs-upwards one for most stuff

The example I showed was a little too zoomed out for normal reference stuff. I'd recommend making your reference slightly closer like this: https://ibb.co/Q33BJDLX

Start at whatever zoom level your initial character pic is at, then make more references at different zoom levels. If you're starting zoomed out, then prompt the model to zoom in. If you start zoomed in, prompt it to zoom out.

And, of course, different angles too.

Examples:

Zoom in on the person's upper body. The composition should frame their head and thighs.
Zoom out to show more of the character. The composition should frame their head and thighs.
Zoom out to a full body shot.
Zoom in for a close up portrait.

Once you've got references, you should usually use the head-to-thighs ref for making new iamges. Switch to the other refs as necessary; like if you want a close up, use the close up reference. Qwedit is really good at keeping likeness, so you can do 90% of your stuff with only a single input reference.

I don't think there's a better open-weight model out there than Qwedit for making new shots of character without loras, right now.

Enjoy!