Creating a DIY fp16 AuraFlow checkpoint

I love the open source AuraFlow model, both for its philosophy, and the impressive prompt adhension. I was especially excited to see the release of fp16 diffuser models on HuggingFace (https://huggingface.co/fal/AuraFlow), as I have a Framework 16 with an RX 7700S 8GB vram gpu (Arch Linux btw). Before the fp16 model, I could just barely get the fp32 model running. I'd have to run up to the end of sampling, save the sampled latent to a file, flush my gpu vram (restart ComfyUI), load the sampled latent, and then finally decode into my image.

Though the fp16 diffuser is awesome, it's a little annoying to get running on ComfyUI, owing to the fact that the sampler is split into two files. So here's the steps I took to get my own diy fp16 AuraFlow checkpoint.

Preview of the results

Before I get too deep in the weeds, I wanted to show upfront how the official fp32 checkpoint compares to my hacked together fp16 checkpoint.

I stole the following prompt from here: https://civitai.com/images/4231908
Positive Prompt:

zrpgstyle, from_side ornate royal robes embroidered runes (royal elderly male dwarf:1.2) dwarven king magnificent palace snowy mountains (granite marble:1.1) thick braided beard bright morning light (masterpiece:1.1) (best quality) (detailed) (intricate) (8k) (HDR) (cinematic lighting) (sharp focus)

Negative Prompt:

plastic shiny tattoo nude bare chest jewelry (photo photography photograph) (bad hands) (disfigured) (grain) (Deformed) (poorly drawn) (mutilated) (lowres) (dark) (lowpoly) (CG) (3d) (blurry) (out-of-focus) (depth_of_field) (duplicate) (watermark) (label) (signature) (text) (cropped)

Sampler: euler
Scheduler: sgm_uniform
Steps: 20
CFG: 3.5
Seed: 242305700926066

FP32:

FP16:

I will leave judging the difference in quality to you, but the fp16 image was created in a single ComfyUI run, whereas the fp32 image required seperating KSampling and VAE Decoding into two separate jobs (saving the latent in between).
See this post (https://civitai.com/posts/4777005) for more examples.

Step by Step guide

Though ComfyUI already has a DiffusersLoader node, it doesn't work with the AuraFlow fp16 diffuser, because the transformer is split into two files. In searching for a solution, I found the ComfyUI-DiffusersLoader (https://github.com/Scorpinaus/ComfyUI-DiffusersLoader), which had functionality to load multi-part CLIP models, but not UNets or transformers. So I forked the repo, and modified to work with the AuraFlow setup (https://github.com/OmegaLambda1998/ComfyUI-DiffusersLoader). Huge thanks to Scorpinaus for the initial nodes!!!

Here are the steps needed to build your own checkpoint:

Git clone my DiffusersLoader fork (https://github.com/OmegaLambda1998/ComfyUI-DiffusersLoader) into your ComfyUI/custom_nodes directory
In ComfyUI/models/diffusers create a folder called AuraFlow or some such, and create the following directories inside it:
vae/ (optional, see below)
transformer/
text_encoder/

From the AuraFlow hf (https://huggingface.co/fal/AuraFlow) download:

model_index.json -> AuraFlow/model_index.json
vae/diffusion_pytorch_model.fp16.safetensors -> AuraFlow/vae/diffusion_pytorch_model.fp16.safetensors (optional, see VAE section)
text_encoder/model.fp16.safetensors -> AuraFlow/text_encoder/model.fp16.safetensors
transformer/diffusion_pytorch_model-00001-of-00002.fp16.safetensors -> AuraFlow/transformer/diffusion_pytorch_model-00001-of-00002.fp16.safetensors
transformer/diffusion_pytorch_model-00002-of-00002.fp16.safetensors -> AuraFlow/transformer/diffusion_pytorch_model-00002-of-00002.fp16.safetensors

The only really important file name is the transformers, which must have 00001-of-00002 and 00002-of-00002.

Actually combining the diffuser files into a single checkpoint is quite easy, though I've included a workflow json in this article just in case.

All you need is:

Diffusers UNET Loader: Loads in and concatenates transformers. Make sure diffuser points to your AuraFlow folder, and file_parts is all
Diffusers CLIP Loader: Loads in the text_encoder. Make sure diffuser points to your AuraFlow folder, clip_type is stable_diffusion (haven't tried other options, but this works), and file_parts is none
Diffusers VAE Loader: Loads in the vae. Make sure diffuser points to your AuraFlow folder.
Save Checkpoint: Combine all the different diffuser parts into a single checkpoint and save it.

Connect all the nodes to the save checkpoint, run it, then move your new checkpoint from outputs/checkpoints/AuraFlow to inputs/checkpoints/whatever.

Viola, one fp16 AuraFlow checkpoint, all for yourself. For my system this reduced the size of the model to 8gb from 16gb (what do you know, half the size). And more importantly, allowed much more convenient use of an amazing model.

Additional Tips and Tricks:

I have an RX 7700S with 8gb of ram, running on Arch Linux. I have the following env vars and ComfyUI cli args enabled which got everything working nicely

Environment Variables:

export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HIP_VISIBLE_DEVICES=0
export ROCR_VISIBLE_DEVICES=0
export PYTORCH_ROCM_ARCH=gfx1102
export HCC_AMDGPU_TARGET=gfx1102
export AMDGPU_TARGETS=gfx1102
export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:512
export TRITON_USE_ROCM=ON
export USE_CUDA=0
export HSA_ENABLE_SDMA=0

No idea how important all of these are, I think only HSA_OVERRIDE_GFX_VERSION and PYTORCH_ROCM_ARCH are strictly necessary, with the correct value depending on your GPU.

Command Line Arguments:

--verbose
--disable-auto-launch
--disable-xformers
--disable-cuda-malloc
--force-fp16
--fp8_e4m3fn-unet # --fp16-unet works but is much slower
--fp16-vae # see vae info
--fp8_e4m3fn-text-enc # --fp16-text-enc works but is much slower
--preview-method none
--use-quad-cross-attention # see vae info
--force-upcast-attention # see vae info
# With the fp16 checkpoint, I don't need lowvram or disable-smart-memory!!!
#--lowvram 
#--disable-smart-memory

Getting fp-16-vae up and running:

To get --fp16-vae working I needed to:

Use torch==2.3.1+rocm5.7, on rocm6.0 I get lots of annoying, weird errors, and rocm6.1 nightly is even worse.
Use quad-cross-attention in addition to force-upcast-attention. This was annoying because I had just gotten https://github.com/Beinsezii/comfyui-amd-go-fast working which enables flash-attention for AMD rocm, i.e. let's me use pytorch-cross-attention which is significantly faster.
Use https://huggingface.co/madebyollin/sdxl-vae-fp16-fix instead of the default AuraFlow fp16 vae model. This just means changing the Diffusers VAE Loader into a normal VAE Loader, and loading the sdxl-vae-fp16-fix model into the checkpoint instead.

Closing Remarks

I won't be uploading the checkpoint because I fully expect fal to release a much higher quality version eventually, and the model is still 8gb which, whilst much smaller, is still fairly significant for my poor Aussie wifi. I expect this setup to become pretty much obselete as fal continue developing and optimising their awesome model, but in the meantime I hope this is useful for us low-vram folk.

AuraFlow fp16 DIY checkpoint with ComfyUI