Nvidia released a toolkit that speeds up models by 30%. Works great on SD3.
https://blogs.nvidia.com/blog/ai-decoded-flux-one/
I want this for Flux. I do not have enough local VRAM to convert the model. Can a hero get this done? Yes I'm aware of the tradeoffs with loras etc. I just want to improve raw doggin' flux dev. This hasn't been done yet to my knowledge, possibly because a float16 issue. From a HF comment:
Could this somehow help? https://github.com/microsoft/onnxscript/pull/1492
See also: https://github.com/microsoft/onnxruntime/issues/13001
and: https://github.com/microsoft/onnxscript/issues/1462
To win this bounty:
Convert the model and upload the FP16 Dev TensorRT to Civitai
Post render comparisons using the TensorRT workflow
Comfy workflow: