MSLK Blackwell Windows 1.2.0 Released
https://github.com/Kittensx/mslk-blackwell-windows
Version 1.2.0 has officially been released.
This release introduces runtime FMHA tuning through environment variables, allowing users to experiment with memory optimization settings without rebuilding MSLK.
In simple terms: you can now adjust some of the low-level CUDA kernel behavior used by attention operations directly from your startup batch file or command prompt.
I think this is one of the biggest improvements since the original Windows Blackwell compatibility work because it allows users to test different configurations on their own hardware instead of relying on a single hardcoded setting.
But it is still an experimental build. Just because memory is reduced doesn't mean that better quality images are going to happen, if you reduce levels down just to go to a higher level. In testing, the best values for me were:
[MSLK FMHA tuning] policy=env platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=32 num_warps=2 num_stages=1set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N=32Looking Back: Version 1.1.0
Version 1.1.0 introduced a hardcoded compatibility fix.
During testing, some Blackwell laptop GPUs encountered memory-related issues when performing large image generations and high-resolution upscales.
The solution at the time was to force a safer kernel configuration for affected hardware.
While this worked, it was fixed in code and required rebuilding MSLK whenever additional experimentation was needed.
What's New in 1.2.0
Version 1.2.0 replaces the hardcoded approach with a runtime tuning system.
Instead of modifying source files and rebuilding, users can now change behavior through environment variables.
Benefits include:
No rebuild required
Easier testing
Better hardware-specific tuning
Runtime debugging information
Future-proof foundation for architecture-specific optimizations
Basic Usage
MSLK is a dependency of XFORMERS if using A1111, FORGE, or similar architectures, you would use this command line argument: --xformers
If your project uses MSLK, you can benefit from this approach.
My recommendations for default behavior but easy testing?
Add all the values, but keep them blank for default behavior.
Example webui-user.bat
@echo off
set PYTHON=
set "GIT=C:\Program Files\Git\cmd\git.exe"
set "VENV_DIR=venv"
set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N=16
set MSLK_FMHA_BLOCK_M=16
set MSLK_FMHA_NUM_WARPS= 4
set MSLK_FMHA_NUM_STAGES=
set "COMMANDLINE_ARGS= --skip-torch-cuda-test --disable-safe-unpickle --xformers"
call webui.batMost users only need one setting:
set MSLK_FMHA_POLICY=auto
This enables automatic detection and tuning.
For troubleshooting:
set MSLK_FMHA_POLICY=auto
set MSLK_FMHA_DEBUG=1
This will print the selected tuning values in the console.
Example:
[MSLK FMHA tuning]
policy=auto
BLOCK_N=32
BLOCK_M=16
num_warps=2
num_stages=1
What Are Environment Variables?
Environment variables are simply settings that are read when the application starts.
For example:
set MSLK_FMHA_POLICY=auto
tells MSLK:
"Automatically select the best tuning policy for my hardware."
You can place these directly in your launch batch file before launching Automatic1111.
Example:
set MSLK_FMHA_POLICY=auto
set MSLK_FMHA_DEBUG=1
call webui.bat
Available Variables
MSLK_FMHA_POLICY
Controls overall tuning behavior.
Options:
default
auto
blackwell_safe
env
off
benchmark
Recommended:
auto
MSLK_FMHA_DEBUG
Enables debug output.
set MSLK_FMHA_DEBUG=1
Useful for verifying which settings MSLK selected.
MSLK_FMHA_BLOCK_N
Controls the width of the key/value tile processed by the attention kernel.
Valid values:
16
32
64
128
This is currently the most important tuning parameter.
General rule:
Lower value
=
Less memory usage
More compatibility
Potentially slower
Higher value
=
More performance
More memory usage
Potentially less stable
MSLK_FMHA_BLOCK_M
Controls query tile size.
Valid values:
16
32
64
128
Usually leave this at:
16
unless specifically benchmarking.
MSLK_FMHA_NUM_WARPS
Controls how many GPU warps participate in kernel execution.
Think of it like how many GPU workers you want to use at one time.
Valid values:
1
2
4
8
Default:
2
MSLK_FMHA_NUM_STAGES
Controls pipeline staging inside Triton.
Why Does This Matter?
GPUs spend a lot of time waiting for data to arrive from memory.
Pipelining allows Triton to overlap:
Memory loads
Computation
Data movementso the GPU spends less time sitting idle.
Valid values:
1
2
3
4
5
Default:
1
Suggested Settings
Most Users
set MSLK_FMHA_POLICY=auto
Done.
This is what I recommend for almost everyone.
RTX 5070 Laptop
Current testing has shown good results with:
set MSLK_FMHA_POLICY=auto
The auto policy automatically selects:
BLOCK_N=32
for known high-memory Blackwell attention paths.
RTX 5070 Desktop
Recommended:
set MSLK_FMHA_POLICY=auto
Additional testing is welcome.
RTX 5080
Recommended:
set MSLK_FMHA_POLICY=auto
If memory issues occur:
set MSLK_FMHA_POLICY=env
set MSLK_FMHA_BLOCK_N=32
RTX 5090
Recommended:
set MSLK_FMHA_POLICY=auto
Many users may find that larger values work fine, but additional testing is needed.
RTX 4090
Recommended:
set MSLK_FMHA_POLICY=auto
or
set MSLK_FMHA_POLICY=env
set MSLK_FMHA_BLOCK_N=64
if experimenting.
Advanced Users
The FMHA kernel operates by dividing attention work into tiles.
The most important tuning value is:
BLOCK_N
which controls how much key/value data is processed per tile.
Larger values:
64
128
can improve throughput but consume more shared memory.
Smaller values:
32
16
reduce memory pressure and can allow workloads that would otherwise fail.
This is especially important on:
Large resolutions
High-resolution upscaling
Large context lengths
High head dimensions
The tradeoff is straightforward:
Less memory
=
Smaller tiles
=
Potentially slower execution
More memory
=
Larger tiles
=
Potentially faster execution
What If I Get Memory Errors?
Try lowering:
set MSLK_FMHA_BLOCK_N=32
If problems continue:
set MSLK_FMHA_BLOCK_N=16
Remember:
Lower settings generally reduce memory usage.
Lower settings may also increase generation time.
The benefit is that you may be able to generate or upscale larger images than before.
Current Limitations
This tuning system is new.
Not every GPU, driver version, CUDA version, model architecture, or workflow has been tested.
Some combinations may provide better performance.
Some combinations may provide better memory efficiency.
Community testing is encouraged.
If you discover a configuration that works particularly well on your hardware, please share your results so we can continue refining the tuning recommendations.
TESTS
settings:
set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N=16
set MSLK_FMHA_BLOCK_M=
set MSLK_FMHA_NUM_WARPS= 4
set MSLK_FMHA_NUM_STAGES=confirmed via logs:
[MSLK FMHA tuning] policy=env platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=16 num_warps=4 num_stages=1
Main takeaway:
Different FMHA kernel settings can still produce the same image output, but they change how much GPU memory the kernel needs and how fast or stable the run may be.

set MSLK_FMHA_POLICY= auto
set MSLK_FMHA_DEBUG=1[MSLK FMHA tuning] policy=auto platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=32 num_warps=2 num_stages=1which is the same picture as these settings:
set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N= 16
set MSLK_FMHA_BLOCK_M=
set MSLK_FMHA_NUM_WARPS= 4
set MSLK_FMHA_NUM_STAGES= 2[MSLK FMHA tuning] policy=env platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=16 num_warps=4 num_stages=2But...
If you try combinations that are outside the realms of your GPU possiblities:
You will run into memory errors like:
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 214016, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.with settings like:
set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N= 32
set MSLK_FMHA_BLOCK_M=
set MSLK_FMHA_NUM_WARPS= 4
set MSLK_FMHA_NUM_STAGES= 4Key Takeaway:
This error means the selected kernel configuration asked for more shared memory than the GPU could provide for that kernel launch.
Example:
Required: 214016
Hardware limit: 101376
So the fix is not to reinstall everything. The fix is to reduce the resource-heavy tuning values.
So what did I do? I kept dropping it down until a setting worked.
set MSLK_FMHA_POLICY= env
set MSLK_FMHA_DEBUG=1
set MSLK_FMHA_BLOCK_N= 32
set MSLK_FMHA_BLOCK_M=
set MSLK_FMHA_NUM_WARPS= 2
set MSLK_FMHA_NUM_STAGES= 2[MSLK FMHA tuning] policy=env platform=Windows capability=(12, 0) gpu_name=NVIDIA GeForce RTX 5070 Laptop GPU is_hip=False Kq=512 Kkv=512 B=1 M=21600 Mq=21600 split_k=64 is_paged=False use_fp8_path=False BLOCK_M=16 BLOCK_N=32 num_warps=2 num_stages=2Final Thoughts
Version 1.1.0 proved that reducing FMHA memory requirements could solve real-world Blackwell issues.
Version 1.2.0 turns that one-off fix into a flexible tuning system.
Instead of rebuilding MSLK every time we want to experiment, users can now adjust behavior at runtime and help identify the best settings for each GPU architecture.

