Today I face the most dreaded words when I train an LLM: CUDA out-of-memory. But don’t worry-I’ve discovered three powerful solutions you can try before considering more expensive hardware upgrades. Wish you CUDAn’t run out of memory again.
00:16 Method 1: reduce the batch size
00:42 Gradient accumulation
01:04 Method 2: mixed precision training
01:28 FP32 vs FP16
02:55 Method 3: gradient checkpointing
If you are a geek like me, you can play with the code here lol: colab.research...
References
1. Automatic mixed precision training in PyTorch: pytorch.org/do...
2. Gradient checkpointing in PyTorch: pytorch.org/do...
16 сен 2024