Sign In

How to Reduce VRAM Usage with xFormers + MSLK 1.2.0

0

How to Reduce VRAM Usage with xFormers + MSLK 1.2.0

MSLK Blackwell Windows 1.2.0 Released

https://github.com/Kittensx/mslk-blackwell-windows

Version 1.2.0 has officially been released.

This release introduces runtime FMHA tuning through environment variables, allowing users to experiment with memory optimization settings without rebuilding MSLK.

In simple terms: you can now adjust some of the low-level CUDA kernel behavior used by attention operations directly from your startup batch file or command prompt.

I think this is one of the biggest improvements since the original Windows Blackwell compatibility work because it allows users to test different configurations on their own hardware instead of relying on a single hardcoded setting.

But it is still an experimental build. Just because memory is reduced doesn't mean that better quality images are going to happen, if you reduce levels down just to go to a higher level. In testing, the best values for me were:

[MSLK FMHA tuning] policy=env platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=32 num_warps=2 num_stages=1
set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N=32

Looking Back: Version 1.1.0

Version 1.1.0 introduced a hardcoded compatibility fix.

During testing, some Blackwell laptop GPUs encountered memory-related issues when performing large image generations and high-resolution upscales.

The solution at the time was to force a safer kernel configuration for affected hardware.

While this worked, it was fixed in code and required rebuilding MSLK whenever additional experimentation was needed.


What's New in 1.2.0

Version 1.2.0 replaces the hardcoded approach with a runtime tuning system.

Instead of modifying source files and rebuilding, users can now change behavior through environment variables.

Benefits include:

  • No rebuild required

  • Easier testing

  • Better hardware-specific tuning

  • Runtime debugging information

  • Future-proof foundation for architecture-specific optimizations


Basic Usage

MSLK is a dependency of XFORMERS if using A1111, FORGE, or similar architectures, you would use this command line argument: --xformers

If your project uses MSLK, you can benefit from this approach.

My recommendations for default behavior but easy testing?

Add all the values, but keep them blank for default behavior.

Example webui-user.bat

@echo off

set PYTHON=
set "GIT=C:\Program Files\Git\cmd\git.exe"
set "VENV_DIR=venv"
set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N=16	
set MSLK_FMHA_BLOCK_M=16
set MSLK_FMHA_NUM_WARPS= 4
set MSLK_FMHA_NUM_STAGES=



set "COMMANDLINE_ARGS= --skip-torch-cuda-test --disable-safe-unpickle --xformers"
call webui.bat

Most users only need one setting:

set MSLK_FMHA_POLICY=auto

This enables automatic detection and tuning.

For troubleshooting:

set MSLK_FMHA_POLICY=auto
set MSLK_FMHA_DEBUG=1

This will print the selected tuning values in the console.

Example:

[MSLK FMHA tuning]
policy=auto
BLOCK_N=32
BLOCK_M=16
num_warps=2
num_stages=1

What Are Environment Variables?

Environment variables are simply settings that are read when the application starts.

For example:

set MSLK_FMHA_POLICY=auto

tells MSLK:

"Automatically select the best tuning policy for my hardware."

You can place these directly in your launch batch file before launching Automatic1111.

Example:

set MSLK_FMHA_POLICY=auto
set MSLK_FMHA_DEBUG=1

call webui.bat

Available Variables

MSLK_FMHA_POLICY

Controls overall tuning behavior.

Options:

default
auto
blackwell_safe
env
off
benchmark

Recommended:

auto

MSLK_FMHA_DEBUG

Enables debug output.

set MSLK_FMHA_DEBUG=1

Useful for verifying which settings MSLK selected.


MSLK_FMHA_BLOCK_N

Controls the width of the key/value tile processed by the attention kernel.

Valid values:

16
32
64
128

This is currently the most important tuning parameter.

General rule:

Lower value
=
Less memory usage
More compatibility
Potentially slower

Higher value
=
More performance
More memory usage
Potentially less stable

MSLK_FMHA_BLOCK_M

Controls query tile size.

Valid values:

16
32
64
128

Usually leave this at:

16

unless specifically benchmarking.


MSLK_FMHA_NUM_WARPS

Controls how many GPU warps participate in kernel execution.

Think of it like how many GPU workers you want to use at one time.

Valid values:

1
2
4
8

Default:

2

MSLK_FMHA_NUM_STAGES

Controls pipeline staging inside Triton.

Why Does This Matter?

GPUs spend a lot of time waiting for data to arrive from memory.

Pipelining allows Triton to overlap:

Memory loads
Computation
Data movement

so the GPU spends less time sitting idle.

Valid values:

1
2
3
4
5

Default:

1

Suggested Settings

Most Users

set MSLK_FMHA_POLICY=auto

Done.

This is what I recommend for almost everyone.


RTX 5070 Laptop

Current testing has shown good results with:

set MSLK_FMHA_POLICY=auto

The auto policy automatically selects:

BLOCK_N=32

for known high-memory Blackwell attention paths.


RTX 5070 Desktop

Recommended:

set MSLK_FMHA_POLICY=auto

Additional testing is welcome.


RTX 5080

Recommended:

set MSLK_FMHA_POLICY=auto

If memory issues occur:

set MSLK_FMHA_POLICY=env
set MSLK_FMHA_BLOCK_N=32

RTX 5090

Recommended:

set MSLK_FMHA_POLICY=auto

Many users may find that larger values work fine, but additional testing is needed.


RTX 4090

Recommended:

set MSLK_FMHA_POLICY=auto

or

set MSLK_FMHA_POLICY=env
set MSLK_FMHA_BLOCK_N=64

if experimenting.


Advanced Users

The FMHA kernel operates by dividing attention work into tiles.

The most important tuning value is:

BLOCK_N

which controls how much key/value data is processed per tile.

Larger values:

64
128

can improve throughput but consume more shared memory.

Smaller values:

32
16

reduce memory pressure and can allow workloads that would otherwise fail.

This is especially important on:

Large resolutions
High-resolution upscaling
Large context lengths
High head dimensions

The tradeoff is straightforward:

Less memory
=
Smaller tiles
=
Potentially slower execution

More memory
=
Larger tiles
=
Potentially faster execution

What If I Get Memory Errors?

Try lowering:

set MSLK_FMHA_BLOCK_N=32

If problems continue:

set MSLK_FMHA_BLOCK_N=16

Remember:

Lower settings generally reduce memory usage.

Lower settings may also increase generation time.

The benefit is that you may be able to generate or upscale larger images than before.

Current Limitations

This tuning system is new.

Not every GPU, driver version, CUDA version, model architecture, or workflow has been tested.

Some combinations may provide better performance.

Some combinations may provide better memory efficiency.

Community testing is encouraged.

If you discover a configuration that works particularly well on your hardware, please share your results so we can continue refining the tuning recommendations.


TESTS

settings:

set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N=16	
set MSLK_FMHA_BLOCK_M=
set MSLK_FMHA_NUM_WARPS= 4
set MSLK_FMHA_NUM_STAGES=

confirmed via logs:

[MSLK FMHA tuning] policy=env platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=16 num_warps=4 num_stages=1

00014-149198116713.png

Main takeaway:

Different FMHA kernel settings can still produce the same image output, but they change how much GPU memory the kernel needs and how fast or stable the run may be.

00015-149198116713.png
set MSLK_FMHA_POLICY= auto
set MSLK_FMHA_DEBUG=1

[MSLK FMHA tuning] policy=auto platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=32 num_warps=2 num_stages=1

which is the same picture as these settings:

set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N=	16
set MSLK_FMHA_BLOCK_M=
set MSLK_FMHA_NUM_WARPS= 4
set MSLK_FMHA_NUM_STAGES= 2

[MSLK FMHA tuning] policy=env platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=16 num_warps=4 num_stages=2

But...

If you try combinations that are outside the realms of your GPU possiblities:

You will run into memory errors like:

    triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 214016, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.

with settings like:

set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N=	32
set MSLK_FMHA_BLOCK_M=
set MSLK_FMHA_NUM_WARPS= 4
set MSLK_FMHA_NUM_STAGES= 4

Key Takeaway:

This error means the selected kernel configuration asked for more shared memory than the GPU could provide for that kernel launch.

Example:

Required: 214016

Hardware limit: 101376

So the fix is not to reinstall everything. The fix is to reduce the resource-heavy tuning values.

So what did I do? I kept dropping it down until a setting worked.

set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N=	32
set MSLK_FMHA_BLOCK_M=
set MSLK_FMHA_NUM_WARPS= 2
set MSLK_FMHA_NUM_STAGES= 2
[MSLK FMHA tuning] policy=env platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=32 num_warps=2 num_stages=2

Final Thoughts

Version 1.1.0 proved that reducing FMHA memory requirements could solve real-world Blackwell issues.

Version 1.2.0 turns that one-off fix into a flexible tuning system.

Instead of rebuilding MSLK every time we want to experiment, users can now adjust behavior at runtime and help identify the best settings for each GPU architecture.

0