Q2K > fp8: The Precision Trap

In my last article (https://civitai.com/articles/23761/attention-seeing-the-matrix) I mentioned that I had made a workflow using very low precision quantization versions of Wan 2.2 I2V. The more I experiment with this specific idea the more power I realize it has and I want to share what I have learned. The idea is so counterintuitive at first but so obvious when understood. Lets get right into it.

About 2 months ago I was on the huggingface page for the Wan 2.2 quants and I was downloading Q5K_S out of sheer morbid curiosity when I noticed there were Q2K files as well. I also downloaded the Q2K files but didn't think I would get any real use out of them. As far as I was concerned at the time my experience with Q4K_S had been good but I was fully onto the fp16 model bandwagon and felt firmly that since I could run the full fp16 precision model (barely), settling for anything less was absurd. Additionally, simply because of standard convention and experience, the idea of using anything other than 4,8, or 16 bit models seemed completely foreign.

The revelation came when I designed what I had assumed to be the most cursed Wan 2.2 workflow the world had ever seen. A 3 sampler setup with Q2K at High noise, Q5K_S at mid-noise (using high noise model but designed to be in the commitment and lock-in phase of denoise), and fp16 at low noise. I assumed that i would get an output looking like Minecraft with shaders cranked or simply something totally cursed and melting/body horror. I assumed the Q2K model would generate a very crude and simplistic early latent that would have very rough and low precision shapes and motion paths, then Q5K_S would come in, see the low precision cursed aftermath of Q2K and plaster its own off-kilter bit-batter onto the cursed latent, then fp16 would get a totally messed up chunky, bizarre, uncanny valley as all hell latent and would faithfully do its best to layer high precision refinement on top of it.

Boy, was I wrong.

All of my current most advanced high level epic videos are made with this exact model progression. For the videos that i have posted recently with included workflows I specifically used Q2K at high noise and Q5K_S at low noise to show it works well even without using fp16 base models. I had unknowingly implemented 'precision gating' into Wan 2.2 and it is incredibly powerful. Lets get into why this works so incredibly well and why you should try it too.

General assumptions dictate that you want to use as high of a precision base model as you can as often as you are able for the most detail, nuance, and realism. In reality, in Wan 2.2, this assumption does NOT hold true. The video diffusion process isn't simply [lots of noise -> refine -> refine more -> polish -> polish more -> final output]. The model asks itself every step, "Given this latent and the conditioning/prompt, what latent at the target noise level would best explain the data?' and then moves the latent in that direction. Early steps only bias the overall trajectory of the latent, they do not define it. As I stated in my previous article, the commitment phase doesn't (ideally) happen until the end of High Noise.

This is where the counterintuitive elements kick in. At the highest noise portions of the denoise process having too much model precision is harmful. All of the preserved nuance, gradients, potential options, motion paths, hypothetical resolutions of identity and so on scramble the models brain.

Think of it this way: At the highest of noise levels using fp16 or even fp8 Wan 2.2 models is like handing the most imaginative and perceptive person to ever live a 5-D 100,000 page Rorschach test and expecting them to come up with something coherent to hand to the next person in line. Additionally, and this is a point that cannot be emphasized enough, if you are also using a low precision text encoder like UMT5xxlQ4 or even UMT5xxlfp8 the 5-D 100,000 page Rorschach test is accompanied by a poorly written vague set of directions. This is where things need to shift as far as our general understanding of where precision belongs and in what stage/place it should reside.

Diffusion is not progressive refinement, it's repeated re-evaluation under decreasing uncertainty.

Because each step is a fresh semantic evaluation, this means:

Early steps decide what kind of thing something is
Later steps decide how that thing looks
And nothing is guaranteed to persist unless the pipeline protects it

In diffusion video models, each frame at each step is re-predicted in the context of neighboring frames, motion priors, and conditioning. The model is not “updating pixels,” but repeatedly solving a constrained inference problem where coherence, identity, and motion compete for attention. Errors and overcommitments made early and/or decisions deferred pile up. The further into denoise the LESS FREEDOM the model has to choose correctly or correct error once commitments are made. Corrections become distortions and errors accumulate.

Q2K (high noise)
Decide what things are, using coarse semantic plausibility.

Q2K is amazing at this role because instead of a megamind playing 5-D chess we have a person sitting in front of a piece of wood with a circle, a triangle, a square, and a star shaped hole and the star shaped hole is taped over. Next to him he has a sphere, a pyramid, and a square shaped plastic widget. The megamind had 100,000 pages to interpret and 5 dimensions to consider when plotting the hypothetical paths and rough initial stages of the video, the Q2K only has 3 choices and he knows which shapes match what hole. Just yesterday I was doing testing on my custom samplers and latest workflow and was using Q5K_S for both High and Mid noise just for fun. A video I was making had an emo chick with a septum piercing in her nose but the video started a good distance away and the camera flew towards her face rapidly over 33 frames. Consistently the Q5K_S model in the highest noise role asserted the septum piercing as a water droplet, snot droplet, sweat bead, you name it, anything and everything but a septum piercing. As soon as I switched back to Q2K at high noise the very first video it correctly inferred a septum piercing and was the shape and color of the septum piercing visible in the image.

In effect what was happening on a simplified level is that Q5K_S was pushing its glasses up on its nose the instant it saw the blob of a shape under the nose and said, "Well AKSHUALLY I can infer from the apparent atmospheric elements and the location of this smear of a blob that due to the likelihood of condensation and or sweat this is most certainly AND I MEAN MOST CERTAINLY, based on my statistical analysis and model priors which I trust completely, AS SHOULD YOU, a water droplet/snot bubble."

Q2K on the other hand said, "Empty head, no thoughts. Emo chick is emo, probably septum piercing, I'll bookmark it as I work and I'll let the others know." This is the GOOD kind of deferred action and confident commitment combined together. The reason that I say it was decided but not not finalized by Q2K is that the final output septum ring was very accurate and finely shaped to match the appearance and color of the distant visible piercing. That level of polish and complexity is finalized at the very tail end if Mid-noise and polished in Low noise. However, identity, object relations, and ownership are all established and carried forward from High noise through the rest of the video.

Q5K_S (High noise model in the role of mid noise)
Decide how those things are structured, without reinterpreting them

At this point, after receiving the latent from Q2K all of the major relationships and motion pathing are hashed out and ready to go. Because Q2K is so low precision there aren't 16 leftover potential motion paths and hypothetical camera movements left unresolved and what Q2K could put through its 3 simple-shaped holes in its wooden board have been fully committed and left unambiguous for the rest of the diffusion process. At this point Q5K_S's pedantic self-assurance is welcomed with open arms as it picks up where Q2K left off but with a much greater level of precision, nuance, and gradient. Importantly: just enough precision to commit to the more complex topology of finalizing things like hands and fingers, facial features, spatial positioning and depth, volume, distance, etc. without leaving behind things like electron cloud hands and ghosting limbs unless you severely starve Mid-noise of needed steps.

fp16 (low noise)
Decide how they look, assuming the story is already correct

With the motion, camera movement, shape, topology, volume, relationships, etc. all figured out the only thing left for fp16 to do is literally polish the turd. In Wan 2.2 you can very effectively polish a turd given that it is very well committed to and finalized. However, this is NOT the best part. The best part is that Q2K and Q5K_S both leave behind very specific types of surfaces for fp16 to apply its exquisitely precise polish. Q2K in particular, due to its simplistic rough shaping and biases, produces extremely good bases for texture and structure. A model like Q4K_S in the same role flattens and smooths things too much or attempts to make them too complex and cant finalize them well enough and they turn into what i call 'bonemeal jello'. This is most often seen in the mouth on teeth and anywhere where translucency and rough edges meet. It's as if Q4K_S simply threw up its hands after trying to whittle away at the teeth and left them a strange mix of dithering and soupy sludge that the low noise model simply cant make heads or tails out of. It also smooths things far too much leaving many parts of the anatomy appearing like plastic or compressed tubes of overly shiny meat to which the low noise model can't apply any kind of texture or visual enhancement.

Q2K on the other hand doesn't have the capacity to over-commit and go overboard, it simply doesn't have the precision or opportunity to do so when used only within the first few steps. This results in some of the absolute best teeth (you may have noticed I've been doing a number of zoom to mouth tests, and I think the results speak for themselves. I won't blame anyone for thinking I had suddenly posted one of my fetishes on main out of the blue, but now you know why XD). Additionally many other parts of the anatomy are left with enough roughness to then be seen as committed and requiring only a gentle amount of work by Q5K_S and then sent on to be enhanced and properly textured by fp16.

This only works because diffusion steps re-evaluate the entire frame at each stage. Precision gating controls which kinds of decisions are even possible at each noise level.

I am exhausted and it's 4:30am so I am going to wrap this article up here. I want to close this article by recommending that all of you, if you are able, switch immediately to the fp16 version of the UMT5xxl text encoder if you are using a lower precision version. Most people simply settle for fp8 or lower, and a lot of people who could run fp16 simply don't because they either don't consider it that big of a deal or that important of a part of the workflow. A lot of people also think the fp16 version is too big to fit in their VRAM. The text encoder does NOT have to fit in VRAM. It is system RAM heavy, yes, but many many people have the system RAM necessary to run fp16 text encoding and simply don't. The text encoder is SO important to every aspect of video generation. The only reason my precision gating works so incredibly well is because the fp16 text encoder is providing perfectly handcrafted specific instructions for the model to use every step of the way. Using fp8 text encoding HALVES the specificity, gradients, and quality of the conditioning the model receives. Does precision gating work with a lower precision text encoder? Absolutely, but the results are noticeably degraded in prompt adherence, refinement of fine edges, color gradients, and textures just going from fp16 to fp8. Going lower than fp8 is exponentially worse. You do not need to match the text encoder precision to the precision of your models. I implore everyone reading this to at least TRY to run fp16 UMT5xxl in their workflow if they current are not using fp16.

Soon I will be returning to my gauntlet of higher education and my articles will most likely be spaced much further apart in the coming months until summer rolls around. My next article will focus on samplers and how different types of samplers fit into this scheme. The days of just using Euler for every sampler need to die and I plan to kill it once and for all. Also, fingers crossed, hopefully at that point I will have my custom samplers specifically tailored for each phase of the sampling process fully finalized and ready for release. I am already using them for my most current videos in the high, mid, and refinement stages and the results are amazing so far. Until then, get crazy and experiment!

Addendum: I wanted to include a short addendum to the article to emphasize that the illustrative example given of using only Q2K and Q5K_S in a two sampler setup is not optimal and far from ideal. While you can certainly make good videos with excellent camera motion, detailed reveals, etc. as evidenced by my example videos, the provided workflow is meant to be a beginner friendly starting point and is explained as such in the accompanying note nodes. I understand that the vast majority of workflows and user experience centers around two samplers but the Q2K/Q5K_S is meant more as an example of how capable the model is at lower precision and meant to be illustrative of the fact that you can definitely make dynamic, interesting, and good quality videos without a massive supercomputer and using fp16 models, and not meant as an end solution. It is very much a Temu version of the full 3 sampler setup.

The overarching point of the article is to emphasize a division of labor into the 3 main phases of the video diffusion process and to focus each sampling phase on its intended function based on how Wan 2.2 operates during sampling. Additionally this article is one in a continuing series and is intended to inform a larger understanding of how Wan and video diffusion functions and not 'do this exactly this way'. All of this is highly experimental, we are all in a truly wild west era of diffusion modeling and I encourage everyone to experiment and challenge any and all assumptions both established and newly proposed.