Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PushPX: GPU kernel optimization #3402

Merged
merged 2 commits into from
Nov 18, 2022
Merged

Conversation

WeiqunZhang
Copy link
Member

The GatherAndPush kernel in the PushPX function has a very low occupancy due to register pressure. There are a number of reasons. By default, we compile with QED module on, even if we do not use it at run time. Another culprit is the GetExternalEB functor that contains 7 Parsers. Again, we have to pay a high runtime cost, even if we do not use it. In this PR, we move some runtime logic out of the GPU kernel to eleminate the unnecessary cost if QED and GetExternalEB are not used at run time.

Here are some performance results before this PR.

| QED | GetExternalEB | Time |
|-----+---------------+------|
| On  | On            | 2.17 |
| Off | On            | 1.79 |
| Off | Commented out | 1.34 |

Note that in the tests neither QED nor GetExternalEB is actually used at run time. But the extra cost is very high. With this PR, the kernel time is the same as that when both QED and GetExternalEB are disabled at compile time, even though both options are disabled at run time.

More information on the kernels compiled for MI250X. The most expensive variant with both QED and GetExternalEB on has

NumSgprs: 108
NumVgprs: 256
NumAgprs: 40
TotalNumVgprs: 296
ScratchSize: 264
Occupancy: 1

The cheapest variant with both QED and GetExternalEB disabled has

NumSgprs: 104
NumVgprs: 249
NumAgprs: 0
TotalNumVgprs: 249
ScratchSize: 144
Occupancy: 2

@WeiqunZhang
Copy link
Member Author

This version works for HIP and nvcc 17. Hopefully it will pass the CIs.

@ax3l ax3l added Performance optimization backend: cuda Specific to CUDA execution (GPUs) backend: hip Specific to ROCm execution (GPUs) component: core Core WarpX functionality component: interpolation Interpolation functions labels Sep 15, 2022
@ax3l ax3l self-assigned this Sep 15, 2022
@ax3l ax3l added the hackathon Let's address this topic during the GPU hackathon label Sep 15, 2022
Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is excellent. Awesome work-around for nvcc 🎉

@WeiqunZhang
Copy link
Member Author

@maikel showed us a very cool trick. https://cuda.godbolt.org/z/edxEMY7YG

@ax3l
Copy link
Member

ax3l commented Sep 16, 2022

Oh awesome, that's the Cartesian product we need 🚀 ✨
Shall we wait for this to be finished? We can use this in many places :)

@WeiqunZhang
Copy link
Member Author

Yes, we could wait till the functionality is in amrex.

@ax3l ax3l changed the title PushPX: GPU kernel optimization [WIP] PushPX: GPU kernel optimization Sep 16, 2022
@dpgrote
Copy link
Member

dpgrote commented Sep 16, 2022

If I can comment here, can this be done with templating instead of using the more obscure std::is_same<decltype..., similar to the templating for doParticlePush? A few comments in the code would be helpful, saying that this is being done to reduce register pressure, avoiding calls to the external fields and QED stuff when not used. And also, the lambda is probably big enough to be a separate routine. Otherwise, this is great with a very nice speed up!

@WeiqunZhang
Copy link
Member Author

Yes, the lambda is big enough. So we do not want to write more than once. Also it captures so many variables, it will also be error prone if we use a non-lambda function because we might mess up the order of the variables in a function's parameters.

@AlexanderSinn
Copy link
Member

@maikel showed us a very cool trick. https://cuda.godbolt.org/z/edxEMY7YG

Here is a N-dimensional version of that, it even compiles with gcc7.5. I am still unsure about that NVCC/NVHPC redefinition problem, however.

https://cuda.godbolt.org/z/xP9nKMYdM

@WeiqunZhang
Copy link
Member Author

What's the redefinition problem with nvcc? The compiler explorer link compiles with nvcc.

@AlexanderSinn
Copy link
Member

#3399 (comment)

@WeiqunZhang
Copy link
Member Author

Oh that. I have no ideas.

@WeiqunZhang
Copy link
Member Author

AMReX-Codes/amrex#2954

@ax3l
Copy link
Member

ax3l commented Oct 25, 2022

@WeiqunZhang we merged in the update of AMReX-Codes/amrex#2954 to WarpX now. Feel free to ping me when this PR is rebased and ready to go 🚀

@ax3l ax3l self-requested a review October 25, 2022 22:40
@WeiqunZhang
Copy link
Member Author

Yes. Thanks for reminding me!

@WeiqunZhang WeiqunZhang changed the title [WIP] PushPX: GPU kernel optimization PushPX: GPU kernel optimization Oct 29, 2022
@WeiqunZhang
Copy link
Member Author

@ax3l It's ready for review.

@WeiqunZhang WeiqunZhang force-pushed the pushpx_v3 branch 2 times, most recently from 1e79d0c to 834f123 Compare November 2, 2022 15:41
Copy link
Member

@RemiLehe RemiLehe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!
This looks almost ready to merge. But it looks like there are some remaining commented lines that still need to be converted to a debug-only code path; is that correct?

Source/Particles/Pusher/PushSelector.H Show resolved Hide resolved
@RemiLehe
Copy link
Member

@WeiqunZhang There seems to be a remaining compilation error with clang in the CI ; is that correct?

@WeiqunZhang
Copy link
Member Author

Failed to build pywarpx

I will try to rerun the job. If it still fails, I will merge development into this to see if that will fix it.

The GatherAndPush kernel in the PushPX function has a very low occupancy due
to register pressure.  There are a number of reasons.  By default, we
compile with QED module on, even if we do not use it at run time.  Another
culprit is the GetExternalEB functor that contains 7 Parsers.  Again, we
have to pay a high runtime cost, even if we do not use it.  In this PR, we
move some runtime logic out of the GPU kernel to eleminate the unnecessary
cost if QED and GetExternalEB are not used at run time.

Here are some performance results before this PR.

    | QED | GetExternalEB | Time |
    |-----+---------------+------|
    | On  | On            | 2.17 |
    | Off | On            | 1.79 |
    | Off | Commented out | 1.34 |

Note that in the tests neither QED nor GetExternalEB is actually used at run
time.  But the extra cost is very high.  With this PR, the kernel time is
the same as that when both QED and GetExternalEB are disabled at compile
time, even though both options are disabled at run time.

More information on the kernels compiled for MI250X.  The most expensive
variant with both QED and GetExternalEB on has

    NumSgprs: 108
    NumVgprs: 256
    NumAgprs: 40
    TotalNumVgprs: 296
    ScratchSize: 264
    Occupancy: 1

The cheapest variant with both QED and GetExternalEB disabled has

    NumSgprs: 104
    NumVgprs: 249
    NumAgprs: 0
    TotalNumVgprs: 249
    ScratchSize: 144
    Occupancy: 2
@WeiqunZhang
Copy link
Member Author

Oh, I think I know why. It needs a more recent version of amrex due to clang is not happy with an amrex function. So I just merged development into this branch.

@WeiqunZhang
Copy link
Member Author

All checks have passed.

@ax3l
Copy link
Member

ax3l commented Nov 18, 2022

ping @lucafedeli88 FYI, as discussed :)

Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, great hackathon success 🎉

@ax3l ax3l enabled auto-merge (squash) November 18, 2022 16:54
@ax3l ax3l merged commit 2775ac1 into ECP-WarpX:development Nov 18, 2022
dpgrote pushed a commit to dpgrote/WarpX that referenced this pull request Nov 22, 2022
* PushPX: GPU kernel optimization

The GatherAndPush kernel in the PushPX function has a very low occupancy due
to register pressure.  There are a number of reasons.  By default, we
compile with QED module on, even if we do not use it at run time.  Another
culprit is the GetExternalEB functor that contains 7 Parsers.  Again, we
have to pay a high runtime cost, even if we do not use it.  In this PR, we
move some runtime logic out of the GPU kernel to eleminate the unnecessary
cost if QED and GetExternalEB are not used at run time.

Here are some performance results before this PR.

    | QED | GetExternalEB | Time |
    |-----+---------------+------|
    | On  | On            | 2.17 |
    | Off | On            | 1.79 |
    | Off | Commented out | 1.34 |

Note that in the tests neither QED nor GetExternalEB is actually used at run
time.  But the extra cost is very high.  With this PR, the kernel time is
the same as that when both QED and GetExternalEB are disabled at compile
time, even though both options are disabled at run time.

More information on the kernels compiled for MI250X.  The most expensive
variant with both QED and GetExternalEB on has

    NumSgprs: 108
    NumVgprs: 256
    NumAgprs: 40
    TotalNumVgprs: 296
    ScratchSize: 264
    Occupancy: 1

The cheapest variant with both QED and GetExternalEB disabled has

    NumSgprs: 104
    NumVgprs: 249
    NumAgprs: 0
    TotalNumVgprs: 249
    ScratchSize: 144
    Occupancy: 2

* Fix Comments

Co-authored-by: Axel Huebl <[email protected]>
@ax3l ax3l mentioned this pull request Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: cuda Specific to CUDA execution (GPUs) backend: hip Specific to ROCm execution (GPUs) component: core Core WarpX functionality component: interpolation Interpolation functions hackathon Let's address this topic during the GPU hackathon Performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants