The op of This tensoRT article is mostly right
A few headache gottchas I ran into since it's a little light on deets. otherwise a nicely done how to. Thanks op.
If you google cuddnn and come across a lot of frustrated reddit posts, and nvidia being on brand for being popular with frustrated people here are the reasons why. I hope I'm helping fill in the some gotchas and omg the gd headaches:
First things first: install cudnn:
This can be done as:
pip install cudnn
Now you might think: cool...
Sigh:
Well alas, no nvidia is being on brand: after a lot of deep hurting I found:
that helps clear things up and a popup error on
https://forums.developer.nvidia.com/t/cuda-error/163799
and:
https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT/issues/12
Sufficed to say this error message doesn't directly help anyone with even half a brain. I probably have less then half a brain. But I do come from a highly educated university. And blah blah googled: copy and pasted from
Downloads\cudnn-windows-x86_64-8.6.0.163_cuda11-archive\bin into:
stable-diffusion-webui\venv\Lib\site-packages\nvidia\cudnn\bin
and did also
from: Downloads\cudnn-windows-x86_64-8.6.0.163_cuda11-archive\lib\x64
TO:
\stable-diffusion-webui\venv\Lib\site-packages\nvidia\cudnn\lib\x64
Now TensorRT should work.
For severely breaking and A1111 and making you want to break something installing TensorRT
Go to:
https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT
Copy and paste ^^^ into the FROM URL under extensions...yes that's right. no I have idea why it's not a normal dot something...
You might want to go pull up Goat simulator it took me about 10 minutes. Pls oh lords of the Ai art make that go faster for others.
Ok so you have it installed and nothing works and you just called a demon to fix it what now? well honestly bug reports are being super great other wise:
Sacrifice something Beelzebub
Threaten your computer
Google: [*&*!@**!@*!((!(@(!!!! this time!!!]
pull up goat simulator
Ok still here: why?
Ok well what I did to fully hork things up and get it working: was what I did at the start!
But just incase:
Having issues trying to get tensorrt working with frigate : r/unRAID (reddit.com)
the new NVIDIA TensorRT extension breaks my automatic1111 : r/StableDiffusion (reddit.com)
the new NVIDIA TensorRT extension breaks my automatic1111 : StableDiffusion (reddit.com)
https://github.com/NVIDIA/TensorRT/issues/851
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5427
Next:
Ok so for some reason you're still reading this. First: cool! hope it's helpful. Next
UGH
(!@^&*()@()!@**@()#)(#()!)(**#$(&!@*(&$*&&$^!&*@#$$%)
Ok sooo now: this one reeeely tripped me up: you aren't recompiling a full model. Just how giving it a faster car!!
So next up how speak Ngre Nvidia: You'll want to go to
Settings--->userinterface--->sd_unet so that Unet shows up. I had to just close my weby-web window and console and restart. Because it just hung out , no refresh.
ALSO : making Unets as far as I can tell doesn't seem to show up in the main window progress, and sometimes doesn't even get started. I have no [redacted] clue why. My set, cursed by loki, Impending doom and destruction, signs of a zombie apocolypse? or just a quirk or bug someplace? probably how it's working at the moment.
So sufficed to say the output will like like small red Casette or VHS tape with goint 928938123 ot 20000000000 seconds or something and might have output kind of like this:
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[I] Loading tactic timing cache from C:\Users\gorks\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\timing_caches\timing_cache_win_cc89.cache
[I] Building engine with configuration:
Flags | [FP16, REFIT, TF32]
Engine Capability | EngineCapability.DEFAULT
Memory Pools | [WORKSPACE: 12281.50 MiB, TACTIC_DRAM: 12281.50 MiB]
Tactic Sources | [CUBLAS, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
Profiling Verbosity | ProfilingVerbosity.LAYER_NAMES_ONLY
Preview Features | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
Building engine: 100%|███████████████████████████████████████████████████████████████████| 6/6 [07:27<00:00, 74.62s/it]
[I] Finished engine building in 453.073 seconds
Or this:
.
Exporting xenoengineAcrossThe_5th1 to TensorRT
{'sample': [(1, 4, 64, 64), (2, 4, 64, 64), (8, 4, 96, 96)], 'timesteps': [(1,), (2,), (8,)], 'encoder_hidden_states': [(1, 77, 768), (2, 77, 768), (8, 154, 768)]}
TensorRT engine found. Skipping build. You can enable Force Export in the Advanced Settings to force a rebuild if needed.
Exporting xenoengineAcrossThe_5th1 to TensorRT
{'sample': [(1, 4, 64, 64), (2, 4, 64, 64), (8, 4, 96, 96)], 'timesteps': [(1,), (2,), (8,)], 'encoder_hidden_states': [(1, 77, 768), (2, 77, 768), (8, 154, 768)]}
TensorRT engine found. Skipping build. You can enable Force Export in the Advanced Settings to force a rebuild if needed.
Exporting xenoengineAcrossThe_5th1 to TensorRT
{'sample': [(1, 4, 64, 64), (2, 4, 64, 64), (8, 4, 96, 96)], 'timesteps': [(1,), (2,), (8,)], 'encoder_hidden_states': [(1, 77, 768), (2, 77, 768), (8, 154, 768)]}
TensorRT engine found. Skipping build. You can enable Force Export in the Advanced Settings to force a rebuild if needed.
Exporting xenoengineAcrossThe_5th1 to TensorRT
See what I mean: I didn't change anything but something it liked the 3rd time? I have no idea what though.
And yes! the gods liked me this time!!
Your output might look like this:
Exporting icbinpICantBelieveIts_afterburn to TensorRT
{'sample': [(1, 4, 64, 64), (2, 4, 64, 64), (8, 4, 96, 96)], 'timesteps': [(1,), (2,), (8,)], 'encoder_hidden_states': [(1, 77, 768), (2, 77, 768), (8, 154, 768)]}
No ONNX file found. Exporting ONNX...
Disabling attention optimization
[I] Folding Constants | Pass 1
[I] Total Nodes | Original: 8992, After Folding: 6216 | 2776 Nodes Folded
[I] Folding Constants | Pass 2
[I] Total Nodes | Original: 6216, After Folding: 4968 | 1248 Nodes Folded
[I] Folding Constants | Pass 3
[I] Total Nodes | Original: 4968, After Folding: 4968 | 0 Nodes Folded
Exported to ONNX.
Building TensorRT engine... This can take a while, please check the progress in the terminal.
[I] Building engine with configuration:
Flags | [FP16, REFIT, TF32]
Engine Capability | EngineCapability.DEFAULT
Memory Pools | [WORKSPACE: 12281.50 MiB, TACTIC_DRAM: 12281.50 MiB]
Tactic Sources | [CUBLAS, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
Profiling Verbosity | ProfilingVerbosity.LAYER_NAMES_ONLY
Preview Features | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
Building engine: 50%|█████████████████████████████████▌ | 3/6 [00:00<00:00, 9.57it/s][W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 22215426048 detected for tactic 0x0000000000000000. costs: 0%| | 0/5 [00:00<?, ?it/s]
Building engine: 100%|███████████████████████████████████████████████████████████████████| 6/6 [01:09<00:00, 11.65s/it]
[I] Finished engine building in 70.837 seconds
stable-diffusion-webui\models\Unet-trt\icbinpICantBelieveIts_afterburn_4e7a3dfd_cc89_sample=1x4x64x64+2x4x64x64+8x4x96x96-timesteps=1+2+8-encoder_hidden_states=1x77x768+2x77x768+8x154x768.trt
TensorRT engines has been saved to disk.
Now the SDXL models look different for out put and make your GPU very happy and crash.
Or they might do this:
Model loaded in 9.7s (create model: 0.6s, apply weights to model: 8.4s, move model to device: 0.2s, calculate empty prompt: 0.2s).
Exporting starlightXLAnimated_v2 to TensorRT
{'sample': [(1, 4, 96, 96), (2, 4, 128, 128), (8, 4, 128, 128)], 'timesteps': [(1,), (2,), (8,)], 'encoder_hidden_states': [(1, 77, 2048), (2, 77, 2048), (8, 154, 2048)], 'y': [(1, 2816), (2, 2816), (8, 2816)]}
No ONNX file found. Exporting ONNX...
Disabling attention optimization
That's Nvidia speak for: were temporarily turning off: being nice to your computer mode so you get renders of animechicks and weird landscaps faster:
but not tooo much is different:
[a few hours later]
If it console looks like this that's it's way of saying: no way in hell is this going to work pls hang up and try again
[I] Building engine with configuration:
Flags | [FP16, REFIT, TF32]
Engine Capability | EngineCapability.DEFAULT
Memory Pools | [WORKSPACE: 12281.50 MiB, TACTIC_DRAM: 12281.50 MiB]
Tactic Sources | [CUBLAS, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
Profiling Verbosity | ProfilingVerbosity.LAYER_NAMES_ONLY
Preview Features | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
Building engine: 50%|██████████████████████████████████ | 3/6 [00:02<00:00, 10.97it/s]
Building engine from subgraph: 0%| | 0/1 [00:00<?, ?it/s]
Computing profile costs: 0%| | 0/5 [00:00<?, ?it/s]
Timing tactics: 0%| | 0/3 [00:00<?, ?it/s]
Timing grap[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[W] Requested amount of GPU memory (12006195200 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 12006195200 detected for tactic 0x0000000000000000.
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)des: 20%|████████████▍ | 51/254 [00:44<03:15, 1.04it/s]
[W] Requested amount of GPU memory (11922309120 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 11922309120 detected for tactic 0x0000000000000000.
Anyway hope this little addendum is helpful