PushPX: GPU kernel optimization #3402

WeiqunZhang · 2022-09-14T22:39:28Z

The GatherAndPush kernel in the PushPX function has a very low occupancy due to register pressure. There are a number of reasons. By default, we compile with QED module on, even if we do not use it at run time. Another culprit is the GetExternalEB functor that contains 7 Parsers. Again, we have to pay a high runtime cost, even if we do not use it. In this PR, we move some runtime logic out of the GPU kernel to eleminate the unnecessary cost if QED and GetExternalEB are not used at run time.

Here are some performance results before this PR.

| QED | GetExternalEB | Time |
|-----+---------------+------|
| On  | On            | 2.17 |
| Off | On            | 1.79 |
| Off | Commented out | 1.34 |

Note that in the tests neither QED nor GetExternalEB is actually used at run time. But the extra cost is very high. With this PR, the kernel time is the same as that when both QED and GetExternalEB are disabled at compile time, even though both options are disabled at run time.

More information on the kernels compiled for MI250X. The most expensive variant with both QED and GetExternalEB on has

NumSgprs: 108
NumVgprs: 256
NumAgprs: 40
TotalNumVgprs: 296
ScratchSize: 264
Occupancy: 1

The cheapest variant with both QED and GetExternalEB disabled has

NumSgprs: 104
NumVgprs: 249
NumAgprs: 0
TotalNumVgprs: 249
ScratchSize: 144
Occupancy: 2

WeiqunZhang · 2022-09-14T22:42:11Z

This version works for HIP and nvcc 17. Hopefully it will pass the CIs.

Source/Particles/Pusher/PushSelector.H

ax3l

That is excellent. Awesome work-around for nvcc 🎉

Source/Particles/Pusher/PushSelector.H

WeiqunZhang · 2022-09-16T16:00:23Z

@maikel showed us a very cool trick. https://cuda.godbolt.org/z/edxEMY7YG

ax3l · 2022-09-16T18:18:37Z

Oh awesome, that's the Cartesian product we need 🚀 ✨
Shall we wait for this to be finished? We can use this in many places :)

WeiqunZhang · 2022-09-16T18:21:39Z

Yes, we could wait till the functionality is in amrex.

dpgrote · 2022-09-16T21:26:21Z

If I can comment here, can this be done with templating instead of using the more obscure std::is_same<decltype..., similar to the templating for doParticlePush? A few comments in the code would be helpful, saying that this is being done to reduce register pressure, avoiding calls to the external fields and QED stuff when not used. And also, the lambda is probably big enough to be a separate routine. Otherwise, this is great with a very nice speed up!

WeiqunZhang · 2022-09-16T21:39:47Z

Yes, the lambda is big enough. So we do not want to write more than once. Also it captures so many variables, it will also be error prone if we use a non-lambda function because we might mess up the order of the variables in a function's parameters.

AlexanderSinn · 2022-09-16T22:01:26Z

@maikel showed us a very cool trick. https://cuda.godbolt.org/z/edxEMY7YG

Here is a N-dimensional version of that, it even compiles with gcc7.5. I am still unsure about that NVCC/NVHPC redefinition problem, however.

https://cuda.godbolt.org/z/xP9nKMYdM

WeiqunZhang · 2022-09-16T22:49:48Z

What's the redefinition problem with nvcc? The compiler explorer link compiles with nvcc.

AlexanderSinn · 2022-09-16T22:50:53Z

#3399 (comment)

WeiqunZhang · 2022-09-16T22:56:32Z

Oh that. I have no ideas.

WeiqunZhang · 2022-09-17T06:13:35Z

AMReX-Codes/amrex#2954

ax3l · 2022-10-25T22:39:48Z

@WeiqunZhang we merged in the update of AMReX-Codes/amrex#2954 to WarpX now. Feel free to ping me when this PR is rebased and ready to go 🚀

WeiqunZhang · 2022-10-25T22:49:28Z

Yes. Thanks for reminding me!

WeiqunZhang · 2022-10-29T19:58:25Z

@ax3l It's ready for review.

RemiLehe

Thanks for this PR!
This looks almost ready to merge. But it looks like there are some remaining commented lines that still need to be converted to a debug-only code path; is that correct?

Source/Particles/Pusher/PushSelector.H

RemiLehe · 2022-11-17T23:03:33Z

@WeiqunZhang There seems to be a remaining compilation error with clang in the CI ; is that correct?

WeiqunZhang · 2022-11-17T23:27:43Z

Failed to build pywarpx

I will try to rerun the job. If it still fails, I will merge development into this to see if that will fix it.

The GatherAndPush kernel in the PushPX function has a very low occupancy due to register pressure. There are a number of reasons. By default, we compile with QED module on, even if we do not use it at run time. Another culprit is the GetExternalEB functor that contains 7 Parsers. Again, we have to pay a high runtime cost, even if we do not use it. In this PR, we move some runtime logic out of the GPU kernel to eleminate the unnecessary cost if QED and GetExternalEB are not used at run time. Here are some performance results before this PR. | QED | GetExternalEB | Time | |-----+---------------+------| | On | On | 2.17 | | Off | On | 1.79 | | Off | Commented out | 1.34 | Note that in the tests neither QED nor GetExternalEB is actually used at run time. But the extra cost is very high. With this PR, the kernel time is the same as that when both QED and GetExternalEB are disabled at compile time, even though both options are disabled at run time. More information on the kernels compiled for MI250X. The most expensive variant with both QED and GetExternalEB on has NumSgprs: 108 NumVgprs: 256 NumAgprs: 40 TotalNumVgprs: 296 ScratchSize: 264 Occupancy: 1 The cheapest variant with both QED and GetExternalEB disabled has NumSgprs: 104 NumVgprs: 249 NumAgprs: 0 TotalNumVgprs: 249 ScratchSize: 144 Occupancy: 2

WeiqunZhang · 2022-11-17T23:46:26Z

Oh, I think I know why. It needs a more recent version of amrex due to clang is not happy with an amrex function. So I just merged development into this branch.

WeiqunZhang · 2022-11-18T01:08:57Z

All checks have passed.

ax3l · 2022-11-18T16:47:35Z

ping @lucafedeli88 FYI, as discussed :)

Source/Particles/PhysicalParticleContainer.cpp

Source/Particles/Pusher/PushSelector.H

ax3l

Awesome, great hackathon success 🎉

* PushPX: GPU kernel optimization The GatherAndPush kernel in the PushPX function has a very low occupancy due to register pressure. There are a number of reasons. By default, we compile with QED module on, even if we do not use it at run time. Another culprit is the GetExternalEB functor that contains 7 Parsers. Again, we have to pay a high runtime cost, even if we do not use it. In this PR, we move some runtime logic out of the GPU kernel to eleminate the unnecessary cost if QED and GetExternalEB are not used at run time. Here are some performance results before this PR. | QED | GetExternalEB | Time | |-----+---------------+------| | On | On | 2.17 | | Off | On | 1.79 | | Off | Commented out | 1.34 | Note that in the tests neither QED nor GetExternalEB is actually used at run time. But the extra cost is very high. With this PR, the kernel time is the same as that when both QED and GetExternalEB are disabled at compile time, even though both options are disabled at run time. More information on the kernels compiled for MI250X. The most expensive variant with both QED and GetExternalEB on has NumSgprs: 108 NumVgprs: 256 NumAgprs: 40 TotalNumVgprs: 296 ScratchSize: 264 Occupancy: 1 The cheapest variant with both QED and GetExternalEB disabled has NumSgprs: 104 NumVgprs: 249 NumAgprs: 0 TotalNumVgprs: 249 ScratchSize: 144 Occupancy: 2 * Fix Comments Co-authored-by: Axel Huebl <[email protected]>

WeiqunZhang mentioned this pull request Sep 14, 2022

PushPX: GPU kernel optimization #3399

Closed

WeiqunZhang force-pushed the pushpx_v3 branch from 45c9930 to dc66277 Compare September 14, 2022 23:01

WeiqunZhang requested review from ax3l and RemiLehe September 15, 2022 00:31

ax3l added Performance optimization backend: cuda Specific to CUDA execution (GPUs) backend: hip Specific to ROCm execution (GPUs) component: core Core WarpX functionality component: interpolation Interpolation functions labels Sep 15, 2022

ax3l self-assigned this Sep 15, 2022

ax3l reviewed Sep 15, 2022

View reviewed changes

Source/Particles/Pusher/PushSelector.H Show resolved Hide resolved

ax3l added the hackathon Let's address this topic during the GPU hackathon label Sep 15, 2022

ax3l approved these changes Sep 15, 2022

View reviewed changes

ax3l requested review from lucafedeli88 and AlexanderSinn September 15, 2022 22:22

WeiqunZhang commented Sep 15, 2022

View reviewed changes

Source/Particles/Pusher/PushSelector.H Show resolved Hide resolved

ax3l changed the title ~~PushPX: GPU kernel optimization~~ [WIP] PushPX: GPU kernel optimization Sep 16, 2022

ax3l self-requested a review October 25, 2022 22:40

ax3l mentioned this pull request Oct 25, 2022

Add accelerator lattice, starting with quadrupoles #3063

Merged

5 tasks

WeiqunZhang force-pushed the pushpx_v3 branch from 7d7cc3a to 345e679 Compare October 29, 2022 19:55

WeiqunZhang changed the title ~~[WIP] PushPX: GPU kernel optimization~~ PushPX: GPU kernel optimization Oct 29, 2022

WeiqunZhang force-pushed the pushpx_v3 branch 2 times, most recently from 1e79d0c to 834f123 Compare November 2, 2022 15:41

RemiLehe requested changes Nov 16, 2022

View reviewed changes

Source/Particles/Pusher/PushSelector.H Show resolved Hide resolved

RemiLehe approved these changes Nov 16, 2022

View reviewed changes

WeiqunZhang force-pushed the pushpx_v3 branch from 834f123 to 38ac0c7 Compare November 17, 2022 23:44

ax3l reviewed Nov 18, 2022

View reviewed changes

Source/Particles/PhysicalParticleContainer.cpp Outdated Show resolved Hide resolved

Source/Particles/Pusher/PushSelector.H Show resolved Hide resolved

Fix Comments

01f7290

ax3l force-pushed the pushpx_v3 branch from 29301ea to 01f7290 Compare November 18, 2022 16:52

ax3l approved these changes Nov 18, 2022

View reviewed changes

ax3l enabled auto-merge (squash) November 18, 2022 16:54

ax3l merged commit 2775ac1 into ECP-WarpX:development Nov 18, 2022

ax3l mentioned this pull request Sep 4, 2024

Optimized PushPX #5199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PushPX: GPU kernel optimization #3402

PushPX: GPU kernel optimization #3402

WeiqunZhang commented Sep 14, 2022

WeiqunZhang commented Sep 14, 2022

ax3l left a comment

WeiqunZhang commented Sep 16, 2022

ax3l commented Sep 16, 2022 •

edited

Loading

WeiqunZhang commented Sep 16, 2022

dpgrote commented Sep 16, 2022

WeiqunZhang commented Sep 16, 2022

AlexanderSinn commented Sep 16, 2022

WeiqunZhang commented Sep 16, 2022

AlexanderSinn commented Sep 16, 2022

WeiqunZhang commented Sep 16, 2022

WeiqunZhang commented Sep 17, 2022

ax3l commented Oct 25, 2022

WeiqunZhang commented Oct 25, 2022

WeiqunZhang commented Oct 29, 2022

RemiLehe left a comment

RemiLehe commented Nov 17, 2022

WeiqunZhang commented Nov 17, 2022

WeiqunZhang commented Nov 17, 2022

WeiqunZhang commented Nov 18, 2022

ax3l commented Nov 18, 2022 •

edited

Loading

ax3l left a comment

PushPX: GPU kernel optimization #3402

PushPX: GPU kernel optimization #3402

Conversation

WeiqunZhang commented Sep 14, 2022

WeiqunZhang commented Sep 14, 2022

ax3l left a comment

Choose a reason for hiding this comment

WeiqunZhang commented Sep 16, 2022

ax3l commented Sep 16, 2022 • edited Loading

WeiqunZhang commented Sep 16, 2022

dpgrote commented Sep 16, 2022

WeiqunZhang commented Sep 16, 2022

AlexanderSinn commented Sep 16, 2022

WeiqunZhang commented Sep 16, 2022

AlexanderSinn commented Sep 16, 2022

WeiqunZhang commented Sep 16, 2022

WeiqunZhang commented Sep 17, 2022

ax3l commented Oct 25, 2022

WeiqunZhang commented Oct 25, 2022

WeiqunZhang commented Oct 29, 2022

RemiLehe left a comment

Choose a reason for hiding this comment

RemiLehe commented Nov 17, 2022

WeiqunZhang commented Nov 17, 2022

WeiqunZhang commented Nov 17, 2022

WeiqunZhang commented Nov 18, 2022

ax3l commented Nov 18, 2022 • edited Loading

ax3l left a comment

Choose a reason for hiding this comment

ax3l commented Sep 16, 2022 •

edited

Loading

ax3l commented Nov 18, 2022 •

edited

Loading