-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARSE] Improve sparse performance on ROCM #7935
Conversation
The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster.
This post says: "They ( I wonder if both approaches use shared memory, why the explicit way as in this PR is faster. |
@masahi With Lower down it says "All active lanes write data to a temporary buffer. All active lanes read data from the temporary buffer...". |
I'm planning to work on improving our GPU scan kernel using warp shuffle instructions, so I want to investigate this issue when I get there. Warp shuffle on AMD being slower than shared memory sounds surprising and counter intuitive. In the PR that introduced warp shuffle support to TVM rocm, #5727, @t-vi mentioned that he got a good speed up on softmax reduction #5727 (comment). So I was under impression that warp shuffle is generally a good thing on AMD too. |
I don't think the descriptions are entirely accurate, but the Vega ISA manual says
so I would expect that the performance lies somewhere between using LDS and registers. I can imagine that doing a lot less writing might save time in this specific case, but it probably is best to check with AMD before drawing global conclusions. |
* [SPARSE] Improve sparse performance on ROCM The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster. * default to shared mem * formatting * formatting
* [SPARSE] Improve sparse performance on ROCM The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster. * default to shared mem * formatting * formatting
* [SPARSE] Improve sparse performance on ROCM The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster. * default to shared mem * formatting * formatting
* [SPARSE] Improve sparse performance on ROCM The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster. * default to shared mem * formatting * formatting
* [SPARSE] Improve sparse performance on ROCM The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster. * default to shared mem * formatting * formatting
The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster.
@tmoreau89 @jwfromm