Sign In

Attention: Seeing the Matrix

3

Dec 31, 2025

(Updated: 3 months ago)

video generation guide
Attention: Seeing the Matrix

I have been on an absolute whirlwind semester of college and have also been doing in-depth testing and research into diffusion models and Wan 2.1 and 2.2. I have gone from an almost mystical transcendental vision of Wan and video diffusion to a bedrock understanding of the systems underlying the model. Now that I have time and the mental budget to write an article, I want to deliver something that once you see it, you can't unsee it. It will change the way you make video, and will take your videos to the next level. The best part about it? It doesn't require a supercomputer with a 5090Ti ultra mega edition, it doesn't require the worlds most optimized workflow,  it only requires an open alert mind, a pair of eyes, and the supplied image processing workflow I have attached to the article as a starter tool for a much larger process of learning a fundamental system of Wan. First, let me explain.

For the longest time I was vexed by the seemingly unpredictable performance of LoRA, motion, prompt adherence, presence or absence of quality and visual clarity. Some videos would come out looking absolutely amazing and then the very next video, with the same theme, using the same LoRA, hell, even with the same prompt, would suddenly come to a screeching halt with locked up motion, quality of the video degrading horribly, etc. There seemed to be no real rhyme or reason and twiddling settings only went so far. Then I had a happy little accident that blew the whole thing wide open.

On a whim and as a joke, I downloaded the Q2K and Q5K_S quants of Wan 2.2 I2V. Both, being abominations that should have never existed, filled me with a grim excitement as I loaded up my 3 sampler Wan 2.2 I2V workflow and loaded Q2K as my first high noise model, Q5K_S high noise as my second model for 'mid-noise' sampling, and fp16 low noise model for the refinement sampler. Surely, I thought, this would produce the most cursed video man has ever seen. I cackled evilly as I pressed the Infer button.

The video came out incredible. I was dumbfounded. How could this possibly be?

*before I go on I want to make it clear that I supply maps, not keys. These concepts, while fundamental to understanding how diffusion works, are highly complex and this article is in the realm of highly advanced concepts. However, I believe that understanding these systems on a basic level and grasping how they effect your video generation is so important that I want to do my best to point you in the right direction to explore these advanced concepts on your own without stumbling blindly like I did.

Attention is a budget, not a switch:

Within Wan and pretty much all diffusion models there are attention (attn.) layers. They are organized in a 2x2 matrix:

Self Structural attention: attn. the model allocates to each individual frame of the latent itself

Self Temporal attention: attn. allocated to the internal consistency of the latent over time

Cross Structural attention: attn. allocated to the conditioning and prompt for each frame

Cross Temporal attention: attn. allocated towards maintaining consistency related to prompting and conditioning over time

The most accurate way, as far as I am concerned, to view the interaction of these attn. layers is to view them as a system under pressure. Each of the four attention layers above apply a certain amount of pressure in the diffusion process. Where and how they apply this pressure can make the difference between a great video and a video that just implodes in on itself and outputs total garbage.

The model has a limited amount of attention, just like us. Also, just like us, things can capture the models attention in ways that cause it to stall out, lock up, and funnel all of its available pressure into irrelevant parts of the latent space. When this happens, it pulls attn. resources that would otherwise be spread out more evenly and concentrates them in what I will refer to as Attention Sinks. When Wan funnels attention into an Attention Sink it gets absolutely obsessed with resolving and finalizing the 'attention trap'. This pulls attention away from surrounding regions like a magnet that grows stronger and stronger until eventually, in a worse case scenario, it starves everything else in the latent and absorbs all attention.

If the model removes attention from something it will lose momentum if it is not already "decided".

A general overview of the three stages of Wan video generation:

High noise explores the 'latent world' and determines what exists

Mid noise decides which version is real (finalized)

Low noise decides how clean and finished the finalized version looks

During the high noise phase of Wan 2.2 the model begins with an Exploration phase: establishing relationships between objects, deciding which objects are connected and owned by which entities (think arms, legs, shoes, glasses, wheels of a car, etc.) and establishing spatial relationships between them. Additionally, the model explores multiple hypothetical large motion trajectories and camera movement paths and these are then collapsed, plotted, and finalized into one final path. Anything Wan is not able to determine or is not given enough steps to determine rolls over into the next phase and piles up in the Decision phase.

The second phase of high noise (what I refer to as mid-noise), the Decision phase, looks at what is left over from the first phase and also 'forms' the objects and entities and their projections into temporal layers (how they are positioned and what space they occupy over time). The model also collapses competing hypotheses of potential paths and movements of objects and entities into their final paths on a much finer more granular level than the exploration phase.

To make this a bit more intuitive I want to supply an example image. After a person in your video exits the Exploration phase they are basically just a flat cardboard cutout with no distinguishing features.

UnresolvedBody1.png

This is an example image of what happens if you force the model to only commit to the Exploration stage and then entirely skip the Decision stage. The person on the bed is like a half formed lump of clay. You can tell they are a person, you can tell some things about their body and positioning, but they have no defined features and they have no volume in space. If for some reason your final videos are coming out like this you need more high noise steps.

In the Decision phase the topology and 'I'm a real boy!' magic takes place. This weird lump of clay is expanded into its finalized version. Facial features, hair, body structure, etc. are structurally finalized. The tell-tale sign that you are starving the model of Decision phase steps is the 'electron cloud' phenomenon. The usual place you see this effect is on the hands.

vlcsnap-2025-12-30-17h26m15s758.png

What you are seeing is not a lack of refinement. What you are seeing is multiple hypothetical motion paths that failed to collapse because the model did not have enough time to decide which one wins. The fingers and hands are usually the last to be decided. No amount of adding low noise steps will refine this away.

Wan is not a rendering engine, it is a probability collapse engine. It explores many plausible interpretations of what could exist, then repeatedly asks which interpretation is most probable under the current constraints. If any decisions are left over before entering low noise they are never resolved.

The vast majority of artifacts and video issues are not a matter of 'needing more refinement', or more low noise steps, or a magical setting or ultra omega advanced cutting edge workflow. Almost all issues stem from too few high noise/mid-noise steps and/or not allowing the model to finalize its stages of exploration and decision due to an imbalanced sampling schedule. Every element of every video needs to switch from extremely expensive high attention unfinalized analysis stage to super cheap low to no attention cost finalized projection stage. The unfinalized elements carry into low noise where the refining stage can do nothing to 'refine them away' but will happily continue to waste attention trying to analyze. If there is anything you take away from this article, it should be this overarching concept. If Wan is not allowed to finalize something or cant finalize something it will argue about it internally forever.

This is where I want to focus on Attention Sinks as a first step in this series of articles to bring some of this home and give you all a tool to take some baby steps on this extremely complex path while also giving you an easily applied and invaluable tool.

The Three Horseman of the Wanpocalypse:

1) Ambiguity

When the model doesn’t know what you mean, it will choose something you didn’t intend and most likely waste a lot of time and resources thinking about it.

2) Improper Scheduling

When decisions are forced too early or deferred too late.

3) Attention Sinks

When one region steals all the budget.

The most common cause of attention sinks are high salience elements that the model is under a great deal of pressure to 'get right' but due to a number of possible issues, simply cant (eyes being a key example). Another class of attention sinks are things the model thinks it 'almost kind of knows what it is but just cant pin it down' so it fails to drop the element from high attention to low attention resolved projection. The third class is high frequency patterns which the tool I am providing is focused on mitigating. These attention sinks result in the model perpetually attempting to resolve the elements and pouring attention into it until the very end.

When the model drops an element from high attention analysis mode to low attention projection it has 'figured it out' to the point where the model no longer has to think about what the object or element looks like from any given angle or at any level of detail within the constraints of the video. It is literally just projecting its resolved identity onto the element in the video like a light projector projects a movie. Now, this doesn't mean its a perfect ultra HD representation, it means that based on the dataset available the model is confident enough that its decided upon projection is sufficiently detailed and accurate enough to represent the object or element in a plausible way given the constraints of the situation and available data.

A short list of common offenders that simply wont let Wan chillax and decide:

1) Eyes and mouths/teeth

2) shoe laces

3) patterned or textured fabrics

4) tattoos

5) hair/beards

6) pubic hair

7) genitals in general

8) specular highlights

9) illegible and complex labels/writing on products

10) many small objects on shelves (particularly with camera movement involved)

If you have spent any time making videos you have undoubtedly ran into many many attention sinks without having a term to slap on them. Attention sinks tend towards a crawling twiddling of pixels appearance in the final output. This is the model 'pushing around the pixels' eternally trying to resolve the attention sink. Now to be clear, often times you may see this same effect on elements in motion or undergoing desired deformation. This is normal high attention behavior. It becomes pathological when the high attention is 'sunk' into something that has very little importance and very low relation to the overall video and prompt. Sometimes you will even see attention sinks become resolved in real time. The biggest example of this is eyes. Often time you will have that crawling twiddling pixels effect on the eye and then the subject will blink, and when the eye opens it is resolved. The eye no longer crawls and is 'relaxed' and falls in line with the other pixels around it. This is the model deciding 'this attention sink I am dealing with right now is costing me way too much attention so I am going to mask the probability collapse of the element with a blink and then reveal the low cost projection.' Its actually incredible to observe once you know what is going on. This also happens when subjects cover something with their hand or pass their arm in front of something and then the object that was occluded is suddenly resolved and 'relaxed'. It is the model performing a slight of hand magic trick. This is not a metaphor, this is literally what is happening. However, sometimes the model can not internally justify the cost or departure from the prompt to make these blinks or movements and the attention sink stays. This is the point of no return.

One true, high level, overarching Arch Enemy of Wan is high frequency patterns. This relates back to my previous articles about overrefined super HD images causing stunted motion and bad outputs in Wan. Wan sees these highly refined and super high quality images as 'nothing for me to do here, its already as perfect as its going to get, I cant justify the cost of changing it'. Additionally, the model also gets lost in the inevitable high frequency elements of the images and dumps attention into trying to resolve and preserve them during the video. High detail = high frequency by necessity. This is made even more frustrating by the fact that image generation models are tailored for and revolve around producing the most defined and detailed outputs.

Recently I dove into ZIT image generation and was immediately blasted by its tendency to generate large swaths of high frequency texture. The thing is, this is why the images generated in this model look so good. The fabric textures, particularly fuzzy fabric in ZIT are way better than many models as an example. This is where the workflow snippet I am providing in this article comes into play. It allows you to use really highly detailed and great quality images from image generation models and also analyze how adjustments to your image generation impact the high frequency levels in the final outputs of your image generation.

So what is the solution? Convince the model that these attention sinks aren't worth the effort by deemphasizing them and/or breaking up the patterns. There are two classes of attention sinks I always look for: high frequency patterns and high salience objects. Some of the sneakiest offenders are things like light switches, wall outlets, smoke detectors, the kinds of objects you easily overlook but the things that Wan, if given too much detail and focus on the object, will easily waste a continuous stream of attention on maintaining and resolving until the very end of the video generation.

The easiest and simplest way to deemphasize these attention sinks is to lightly blur them. Doing so breaks up the fine edges and repeating patterns that Wan will lock onto and cause the model to endlessly commit attention to its resolution. The important thing for the objects like wall outlets and light switches isn't to render them smeared blobs that don't resemble a light switch, but to leave enough detail for the model to tell what it is, but not enough detail for the model to obsess over reproducing in the final resolved projection (that never comes because the object is never resolved). If through the use of this tool you overblur something you push it from an attention sink into the land of ambiguity, and we don't want to trade one evil for another.

The purpose of this tool is to shore up any potential attention sinks so that the model can properly distribute its attention and resolve video elements that are actually important. This will result in better motion, better textures, more resolved and 'settled' outputs, etc. This is not a magical tool that will fix any image but it can take an image from 'really dang hard to work with' to 'I can work with this' in the eyes of Wan.

This workflow snipped does require some custom nodes to be installed. I replaced everything I could with standard nodes but for ease of use some nodes are from custom packs. They can be found below and also can be downloaded from the comfyUI manager 'custom nodes manager'.

https://github.com/1038lab/ComfyUI-RMBG

https://github.com/aiaiaikkk/ComfyUI-Curve

https://github.com/WASasquatch/was-node-suite-comfyui

https://github.com/kijai/ComfyUI-KJNodes

When you load the snippet, use the Load and Resize Image node to load the image you wish to use as the starting image for your video. Set the resize dimensions to a resolution close to or matching the resolution of your final video. This is important! If you apply the blur to a much higher resolution image and then resize it before making the video this can have unintended consequences and overblur or underblur as well as distort the image. I make my videos at the native latent resolution of 384x512 but set this load and resize image node to 480x640 as an example.

Then, hit the infer button. in the two preview image nodes you will see the resized image on the left and the 'Wan Matrix Vision' output on the right.

exampleimage1.jpg

In the above example we can see that the crumpled clothing on the floor next to the woman's shoe, the mans shoes and shoe laces, their teeth, the folds in the denim jeans, the bed spread are all sources of potential attention sinks based on pattern frequency. Anything that glows pure white is almost always an attention sink, and anything that has a very complex series of unbroken lines will also turn into an attention sink because the model wants to preserve that pattern but will not be able to collapse the pattern into a low cost projection. We can also see the light switches, knobs on the dresser, and lamp stand out. This view gives us an idea of what Wan 'sees' when its trying to resolve elements of a latent and allows us to dampen the 'shine' of things we don't want the model to focus on. This prevents the 'OOOO SHINY' reaction of Wan to things we do not want undo attention directed towards. An important distinction: the patterns on the walls and floor are not high frequency. Additionally, Wan is incredibly good at dealing with uniform wall and floor textures that appear like this in the analyzer. You can safely leave those untouched.

Now, in the load and resize image node right click on the image and select Open in Mask Editor | Image Canvas. I won't give you exact settings for the brush or opacity, this is something you will have to play with and find the best settings for yourself and your images. There is no one size fits all settings here. What we do here is blot a mask on the attention sink areas of the image.

exampleblur1resize.jpg

For large patterns like the bed sheet we use irregular blots of mask to break up the pattern. For objects like shoes we just do the whole object. Light switches and knobs, etc. are more of a hit or miss thing but less blur is better than too much. When you are satisfied with your blur placement hit save on the top and run the workflow and watch the preview of the Matrix Mode image. You will see it instantly dim in the regions you blotted the blur over. It's something that is very hard to see in side by side image comparisons between two outputs, but when you see it happen in real time in the workflow the difference is massive and immediately noticeable. If the output image on the left looks too blurry you can adjust the blur settings in the blur node and reopen the mask application window, hit clear, and reapply the mask with a softer touch. This isn't about destroying detail, only weakening the fine edges. Then when you are satisfied with the 'edge knocked off' of the image (this will take time of actual usage of images in videos and seeing the effect on attention traps to really gauge) then you can right click on the preview image on the left that contains the finalized image and save this for your I2V starting frame.

Congratulations! You are now officially practicing Frequency Hygiene!

In the next article I will go in depth about the incident with the Q2K and Q5K_S models and how this brought into focus the balance between precision and decision in Wan video generation. I look forward to hearing from all of you and I am so glad to be able to get this article out to all of you after such a long pause. Happy experimenting!

3