-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster GELU forward & backward using MUFU.TANH for SM7.5+ #721
base: master
Are you sure you want to change the base?
Conversation
…y defining PRECISE_GELU_TANH)
Oops, /dev/cuda/ numbers are for non-MUFU.TANH part of the changes (e.g. faster way to do the derivative for backward) because I compiled without specifying the arch, so it's not SM7.5 😅 The val loss numbers for llm.c are correct though. Can't test it until back on Tuesday but expecting it to be way faster |
…increase batch sizes for GELU fwd/bwd to hit closer to peak
zI think there's an error for both /dev/cuda/common.h and test_gpt2.cu using an epsilon of 0.079 instead of 0.0079 for BF16 which makes the error threshold too high (missing some fairly large errors) - but even after fixing that, this seems to pass all the tests :) The bandwidth calculations for gelu_backward were broken, and the batch size was wayyyy too small so it couldn't saturate H100 on either kernel, this is the correct performance (compiling for SM9.0):
===> +6% for forward and +27% for backward. |
This sounds cool, I guess you only tried for the little shakespeare training run, i wonder if the slight accuracy decrease could cause training instabilities, probably should try a bigger run? |
…s is negligible for BF16, proven with gelu_precision_test branch)
Haven't tried a bigger run, but made very quick & dirty precision tests ("gelu_precision_test" branch) to see if it made any difference after you round to BF16 (i.e. with only 7 mantissa bits, is the maximum error basically a single rounding error?) It is indeed negligible for forward! It's trickier to test for backward (with 2x BF16 inputs there are 2^32 possible bits, forward is only 2^16) but error seems extremely small except with insanely large dout in the millions/billions which would point to a much bigger problem anyway! (and even then the error isn't that bad relative to the magnitude of the inputs). For forward, only 60 out of 65536 inputs result in any difference (the other 65476 inputs have the exact same BF16 outputs bit-for-bit). The worst error is with inputs -4.875 to -5.15625 where the output gets rounded down from a very small number to zero but that's basically nothing compared to what happens when we use FP8 (which couldn't represent anything near those tiny outputs anyway). So I'm pretty sure it's fine :) But I edited my PR so that it would never be used in FP32 mode to make sure that remains a good reference point.
|
These are faster GELU kernels by using the HW instruction NVIDIA introduced for this in Turing (SM7.5) but never exposed outside of PTX as far as I can tell, possibly because it's slightly less accurate - but based on the val loss we get which is slightly better for the backwards pass (and within noise for forward), I am pretty sure it's fine for our purposes!
This is only somewhat faster on H100 PCIe but should be much faster on H200/Blackwell as they have more DRAM bandwidth relative to compute, and also much faster with FP8 (this was originally done in the context of the FP8 branch where it was >50% faster!)
This also includes the change for backward suggested in #307
Included the changes in /dev/cuda/ as well:
In terms of loss for a few steps of Tiny Shakespeare, 'make train_gpt2cu && ./train_gpt2cu -r 0 -ge 0 -e "d12"' gives:
so actually slightly better (but potentially noise)! Using the new backward but the old forward pass gives a val loss of 5.942228, but again, it might be noise. Either way looks to be good enough as far as I can tell!
I believe this NVIDIA forum thread (and a few others) talk a little bit about this HW instruction: https://forums.developer.nvidia.com/t/hardware-accelerated-tanh-on-turing/173291