santa hat
deerdeer nosedeer glow
Sign In

Train 3x-5x Times faster on XL or Pony with limited VRAM

Train 3x-5x Times faster on XL or Pony with limited VRAM

If this has been covered elsewhere I hadn't run across it. Compared to Adam 8bit I was training 3x-5x faster even while training the TE (Not U-Net Only)

I have found I can train with Adapt Lion at Dim 32 or 64 with only 8GB of VRAM

  1. Full FP8

  2. 1024x1024

  3. Batch Size 1

  4. FP16 or BF16 Accelerate (I have found no speed difference FP16 or BF16 when using FP8)

Note this uses:

  1. 20GB of Ram

  2. 50GB of Virtual Memory

I am not using dynamo or Deepspeed.

I compiled Deepspeed it took around 15 minutes, I could never get it to work correctly and it was slowing my training down. It did successfully load I can even get it to load into Comfy UI but I don't think it is properly being used.

I do have a custom compiled Cuda 12.5, CudNN with CuBlass. Honestly the CuBlass might just be taking up extra VRAM

17

Comments