-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized PushPX #5199
Comments
Hi @liuyuchuncn, Thank you for profiling this kernel and for aiming to improve its performance, this is super welcome! 🚀 ✨ Let me also loop in my colleague @WeiqunZhang, who last year reduced the register pressure somewhat by doing compile-time splitting of this kernel in #3402 and #3696. There can be more done for sure and we are happy to support you in optimizing this! Thanks a lot for your interest already! |
@liuyuchuncn Could you try this PR? #5217 |
@liuyuchuncn can you share the inputs file you used for benchmarking please? :) We just want to make sure we look at the same kernels. |
Hello @liuyuchuncn, thanks for reporting this. I also agree that push/deposition has a lot of potential for improvement. Which GPU are you referring to with M200, do you mean AMD MI210 or something else? The main issue I see is the computation of the shape factor. The code WarpX/Source/Particles/Gather/FieldGather.H Lines 162 to 194 in 5b34b84
amrex::Real sz_node[depos_order + 1]; ) and swapping them if necessary. However, on GPU, these local arrays in some ways don't exist like on CPU. When they are indexed with a dynamic index (not known at compile time) the full array has to be placed in thread local memory (which is in the GPU main memory, so much slower than registers) and then the indexed value can be retrieved. Ideally, the shape factor would be fully rewritten to not include local arrays (see https://github.com/Hi-PACE/hipace/blob/40fed7b01b1b985ecd6b417c37ec99d76ef2bc7a/src/particles/particles_utils/ShapeFactors.H#L123).
Another issue is that in the PushPx kernel, the shape factor nox and galerkin_interpolation is a runtime parameter, meaning that all possible shape factors have to be compiled in the same kernel. This can be fixed by extending the CTOs to nox and galerkin_interpolation. Finally, when looking at the assembly, keep in mind that instructions like add and move are much cheaper compared to loading and storing global (and thread local) memory. |
Yes that is correct. What I mean with a dynamic index is the index when the array is used (not defined) in WarpX/Source/Particles/Gather/FieldGather.H Lines 369 to 377 in 5b34b84
for example, ix in sx_ex[ix] . This is a special case where the loop bounds of the for loop are known at compile time, and in case the loop is fully unrolled by the compiler, ix is also known at compile time and using local memory can be avoided. However, unrolling the loop requires much more registers. But I suspect because the array can be swapped when its initialized ((ex_type[zdir] == NODE) ? sz_node : sz_cell ); it has to use local memory anyway.
|
Hi @AlexanderSinn I agree with you WarpX/Source/Particles/Gather/FieldGather.H Lines 162 to 195 in 5b34b84
In addition, if the compiler knows the number of for loops, it will unroll the loop, which will increase the use of registers and cause greater register pressure. WarpX/Source/Particles/Gather/FieldGather.H Lines 369 to 377 in 5b34b84
|
the input file i uesd
Reduce Local memory access issues by putting arrays into shared memory, The pseudocode is modified as follows
Performance improvement is only 3% |
Shared memory is useful for me in reducing local memory , close this issue |
Kernel : PushPX general local memory , We want to analyze the causes of local memory and how to optimize local memory
As shown in the figure below, the array produces local memory
amrex::Real sx_node_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt}; ``amrex::Real sx_cell_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt};
and
amrex::Real sx_node_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt}; ``amrex::Real sx_cell_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt};
used to compute ShapeFactorWarpX/Source/Particles/ShapeFactors.H
Lines 55 to 70 in 5b34b84
The text was updated successfully, but these errors were encountered: