Every BF16 Model Is “Fake” — And Here’s Why

Even with a standard EPS of 1e8 and gradient clipping, the likely hood of not having a single near 0 or near 1e-10 value in billions of values points to FP16 rounding at some level or fallback.

Any CPU call that that does not explicitly use FP32 (INT64 in some cases which is horrible as you loose precsion) rounds BF16 to FP16 values, while this is not the fault of safetensors format, it might be a a host of issues including:

Autocasting
NUMPY calls by transfomers (ANY NUMPY CALL that is not explicitly FP32)
Custom attention builds of FLASH Attention or Xformers

Cast a BF16 model like QWEN or even a FP32 model like T5 and try to find a 1e-10 value or any of the 50k values that would overflow FP16 - I could not find one value. Granted that is less then 1% of the FP32 values but still it should exist.

Another possibility is TF32 being completely ignored unless cutlass and flashattention are installed correctly, this is a major pain on windows.

Every indication is PyTorch does not call mma.h directly and my not use WMMA/Tensor Core kernel without cutlass

~~The tensor is quietly converted to~~ ~~FP16~~. (Safetensors and Pytorch is not at fault, Rich Flecher math is fine for saving the format.)
When moved to CPU, it is converted to FP32.

Why?

PyTorch has no BF16 support on 99% of CPU's. (Autocast will pretend your CPU can using FP32)
Most CPUs cannot handle BF16 arithmetic (except some Xeon CPUs).
If you have a Xeon the math for 512 is here
Architecture of X64 systems Linux, Windows etc. need FP32 variant.

So even though your GPU computations use BF16, the saved values are not truly BF16 if they are passed through a kernel call at any point that is not full float.

Do to the amount of lines of code, it is extremely hard to track down why. I spent hours looking at the C for pytorch, I still have not found any math for CUDA regarding BF16, I only see FP16 imports, however this would not be the cause of the issue just slow down rounding by not using cuda_bf16.h

Again millions of lines of code, it is not humanly possible to go through line by line, and GPT, Claude can only point you to where the file might be.

How to save BF16 properly without it being cast to FP16

Do not use AMP, the attention call seems to pass a kernel call that sends it to FP16 and back to BF16?

Make sure when transformers imports numpy, numpy stays in fp32?

The bottom line

~~Currently, there is~~ ~~no straightforward way to save BF16 tensors in PyTorch while keeping them truly BF16~~.

Corrected, Safetensors has sound math for saving, AMP seems to be what causes the kernel call or possibly something with a fallback if the WMMA/Tensor Core kernel can not be called when using TF32