-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PushPX: GPU kernel optimization #3402
Conversation
This version works for HIP and nvcc 17. Hopefully it will pass the CIs. |
45c9930
to
dc66277
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is excellent. Awesome work-around for nvcc 🎉
@maikel showed us a very cool trick. https://cuda.godbolt.org/z/edxEMY7YG |
Oh awesome, that's the Cartesian product we need 🚀 ✨ |
Yes, we could wait till the functionality is in amrex. |
If I can comment here, can this be done with templating instead of using the more obscure |
Yes, the lambda is big enough. So we do not want to write more than once. Also it captures so many variables, it will also be error prone if we use a non-lambda function because we might mess up the order of the variables in a function's parameters. |
Here is a N-dimensional version of that, it even compiles with gcc7.5. I am still unsure about that NVCC/NVHPC redefinition problem, however. |
What's the redefinition problem with nvcc? The compiler explorer link compiles with nvcc. |
Oh that. I have no ideas. |
@WeiqunZhang we merged in the update of AMReX-Codes/amrex#2954 to WarpX now. Feel free to ping me when this PR is rebased and ready to go 🚀 |
Yes. Thanks for reminding me! |
7d7cc3a
to
345e679
Compare
@ax3l It's ready for review. |
1e79d0c
to
834f123
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR!
This looks almost ready to merge. But it looks like there are some remaining commented lines that still need to be converted to a debug-only code path; is that correct?
@WeiqunZhang There seems to be a remaining compilation error with |
I will try to rerun the job. If it still fails, I will merge development into this to see if that will fix it. |
The GatherAndPush kernel in the PushPX function has a very low occupancy due to register pressure. There are a number of reasons. By default, we compile with QED module on, even if we do not use it at run time. Another culprit is the GetExternalEB functor that contains 7 Parsers. Again, we have to pay a high runtime cost, even if we do not use it. In this PR, we move some runtime logic out of the GPU kernel to eleminate the unnecessary cost if QED and GetExternalEB are not used at run time. Here are some performance results before this PR. | QED | GetExternalEB | Time | |-----+---------------+------| | On | On | 2.17 | | Off | On | 1.79 | | Off | Commented out | 1.34 | Note that in the tests neither QED nor GetExternalEB is actually used at run time. But the extra cost is very high. With this PR, the kernel time is the same as that when both QED and GetExternalEB are disabled at compile time, even though both options are disabled at run time. More information on the kernels compiled for MI250X. The most expensive variant with both QED and GetExternalEB on has NumSgprs: 108 NumVgprs: 256 NumAgprs: 40 TotalNumVgprs: 296 ScratchSize: 264 Occupancy: 1 The cheapest variant with both QED and GetExternalEB disabled has NumSgprs: 104 NumVgprs: 249 NumAgprs: 0 TotalNumVgprs: 249 ScratchSize: 144 Occupancy: 2
834f123
to
38ac0c7
Compare
Oh, I think I know why. It needs a more recent version of amrex due to clang is not happy with an amrex function. So I just merged development into this branch. |
All checks have passed. |
ping @lucafedeli88 FYI, as discussed :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, great hackathon success 🎉
* PushPX: GPU kernel optimization The GatherAndPush kernel in the PushPX function has a very low occupancy due to register pressure. There are a number of reasons. By default, we compile with QED module on, even if we do not use it at run time. Another culprit is the GetExternalEB functor that contains 7 Parsers. Again, we have to pay a high runtime cost, even if we do not use it. In this PR, we move some runtime logic out of the GPU kernel to eleminate the unnecessary cost if QED and GetExternalEB are not used at run time. Here are some performance results before this PR. | QED | GetExternalEB | Time | |-----+---------------+------| | On | On | 2.17 | | Off | On | 1.79 | | Off | Commented out | 1.34 | Note that in the tests neither QED nor GetExternalEB is actually used at run time. But the extra cost is very high. With this PR, the kernel time is the same as that when both QED and GetExternalEB are disabled at compile time, even though both options are disabled at run time. More information on the kernels compiled for MI250X. The most expensive variant with both QED and GetExternalEB on has NumSgprs: 108 NumVgprs: 256 NumAgprs: 40 TotalNumVgprs: 296 ScratchSize: 264 Occupancy: 1 The cheapest variant with both QED and GetExternalEB disabled has NumSgprs: 104 NumVgprs: 249 NumAgprs: 0 TotalNumVgprs: 249 ScratchSize: 144 Occupancy: 2 * Fix Comments Co-authored-by: Axel Huebl <[email protected]>
The GatherAndPush kernel in the PushPX function has a very low occupancy due to register pressure. There are a number of reasons. By default, we compile with QED module on, even if we do not use it at run time. Another culprit is the GetExternalEB functor that contains 7 Parsers. Again, we have to pay a high runtime cost, even if we do not use it. In this PR, we move some runtime logic out of the GPU kernel to eleminate the unnecessary cost if QED and GetExternalEB are not used at run time.
Here are some performance results before this PR.
Note that in the tests neither QED nor GetExternalEB is actually used at run time. But the extra cost is very high. With this PR, the kernel time is the same as that when both QED and GetExternalEB are disabled at compile time, even though both options are disabled at run time.
More information on the kernels compiled for MI250X. The most expensive variant with both QED and GetExternalEB on has
The cheapest variant with both QED and GetExternalEB disabled has