If this has been covered elsewhere I hadn't run across it. Compared to Adam 8bit I was training 3x-5x faster even while training the TE (Not U-Net Only)
I have found I can train with Adapt Lion at Dim 32 or 64 with only 8GB of VRAM
Full FP8
1024x1024
Batch Size 1
FP16 or BF16 Accelerate (I have found no speed difference FP16 or BF16 when using FP8)
Note this uses:
20GB of Ram
50GB of Virtual Memory
I am not using dynamo or Deepspeed.
I compiled Deepspeed it took around 15 minutes, I could never get it to work correctly and it was slowing my training down. It did successfully load I can even get it to load into Comfy UI but I don't think it is properly being used.
I do have a custom compiled Cuda 12.5, CudNN with CuBlass. Honestly the CuBlass might just be taking up extra VRAM