Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized PushPX #5199

Closed
liuyuchuncn opened this issue Aug 31, 2024 · 8 comments
Closed

Optimized PushPX #5199

liuyuchuncn opened this issue Aug 31, 2024 · 8 comments
Assignees
Labels
backend: cuda Specific to CUDA execution (GPUs) Performance optimization

Comments

@liuyuchuncn
Copy link

liuyuchuncn commented Aug 31, 2024

Kernel : PushPX general local memory , We want to analyze the causes of local memory and how to optimize local memory
As shown in the figure below, the array produces local memory amrex::Real sx_node_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt}; ``amrex::Real sx_cell_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt};

image

and amrex::Real sx_node_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt}; ``amrex::Real sx_cell_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt}; used to compute ShapeFactor

return j-1;
}
else if constexpr (depos_order == 3){
const auto j = static_cast<int>(xmid);
const T xint = xmid - T(j);
sx[0] = (T(1.0))/(T(6.0))*(T(1.0) - xint)*(T(1.0) - xint)*(T(1.0) - xint);
sx[1] = (T(2.0))/(T(3.0)) - xint*xint*(T(1.0) - xint/(T(2.0)));
sx[2] = (T(2.0))/(T(3.0)) - (T(1.0) - xint)*(T(1.0) - xint)*(T(1.0) - T(0.5)*(T(1.0) - xint));
sx[3] = (T(1.0))/(T(6.0))*xint*xint*xint;
// index of the leftmost cell where particle deposits
return j-1;
}
else if constexpr (depos_order == 4){
const auto j = static_cast<int>(xmid + T(0.5));
const T xint = xmid - T(j);
sx[0] = (T(1.0))/(T(24.0))*(T(0.5) - xint)*(T(0.5) - xint)*(T(0.5) - xint)*(T(0.5) - xint);

@ax3l ax3l added backend: cuda Specific to CUDA execution (GPUs) Performance optimization labels Sep 4, 2024
@ax3l
Copy link
Member

ax3l commented Sep 4, 2024

Hi @liuyuchuncn,

Thank you for profiling this kernel and for aiming to improve its performance, this is super welcome! 🚀 ✨

Let me also loop in my colleague @WeiqunZhang, who last year reduced the register pressure somewhat by doing compile-time splitting of this kernel in #3402 and #3696.

There can be more done for sure and we are happy to support you in optimizing this! Thanks a lot for your interest already!

@WeiqunZhang
Copy link
Member

@liuyuchuncn Could you try this PR? #5217

@ax3l
Copy link
Member

ax3l commented Sep 5, 2024

@liuyuchuncn can you share the inputs file you used for benchmarking please? :) We just want to make sure we look at the same kernels.

@liuyuchuncn liuyuchuncn changed the title doGatherShapeN 50% slower than A100 Kernel : PushPX 50% slower than A100 Sep 5, 2024
@AlexanderSinn
Copy link
Member

Hello @liuyuchuncn, thanks for reporting this. I also agree that push/deposition has a lot of potential for improvement. Which GPU are you referring to with M200, do you mean AMD MI210 or something else?

The main issue I see is the computation of the shape factor. The code

const amrex::Real z = (zp-xyzmin.z)*dinv.z;
amrex::Real sz_node[depos_order + 1];
amrex::Real sz_cell[depos_order + 1];
amrex::Real sz_node_v[depos_order + 1 - galerkin_interpolation];
amrex::Real sz_cell_v[depos_order + 1 - galerkin_interpolation];
int l_node = 0;
int l_cell = 0;
int l_node_v = 0;
int l_cell_v = 0;
if ((ex_type[zdir] == NODE) || (ey_type[zdir] == NODE) || (bz_type[zdir] == NODE)) {
l_node = compute_shape_factor(sz_node, z);
}
if ((ex_type[zdir] == CELL) || (ey_type[zdir] == CELL) || (bz_type[zdir] == CELL)) {
l_cell = compute_shape_factor(sz_cell, z - 0.5_rt);
}
if ((ez_type[zdir] == NODE) || (bx_type[zdir] == NODE) || (by_type[zdir] == NODE)) {
l_node_v = compute_shape_factor_galerkin(sz_node_v, z);
}
if ((ez_type[zdir] == CELL) || (bx_type[zdir] == CELL) || (by_type[zdir] == CELL)) {
l_cell_v = compute_shape_factor_galerkin(sz_cell_v, z - 0.5_rt);
}
const amrex::Real (&sz_ex)[depos_order + 1 ] = ((ex_type[zdir] == NODE) ? sz_node : sz_cell );
const amrex::Real (&sz_ey)[depos_order + 1 ] = ((ey_type[zdir] == NODE) ? sz_node : sz_cell );
const amrex::Real (&sz_ez)[depos_order + 1 - galerkin_interpolation] = ((ez_type[zdir] == NODE) ? sz_node_v : sz_cell_v);
const amrex::Real (&sz_bx)[depos_order + 1 - galerkin_interpolation] = ((bx_type[zdir] == NODE) ? sz_node_v : sz_cell_v);
const amrex::Real (&sz_by)[depos_order + 1 - galerkin_interpolation] = ((by_type[zdir] == NODE) ? sz_node_v : sz_cell_v);
const amrex::Real (&sz_bz)[depos_order + 1 ] = ((bz_type[zdir] == NODE) ? sz_node : sz_cell );
int const l_ex = ((ex_type[zdir] == NODE) ? l_node : l_cell );
int const l_ey = ((ey_type[zdir] == NODE) ? l_node : l_cell );
int const l_ez = ((ez_type[zdir] == NODE) ? l_node_v : l_cell_v);
int const l_bx = ((bx_type[zdir] == NODE) ? l_node_v : l_cell_v);
int const l_by = ((by_type[zdir] == NODE) ? l_node_v : l_cell_v);
int const l_bz = ((bz_type[zdir] == NODE) ? l_node : l_cell );
relies heavily on local arrays (amrex::Real sz_node[depos_order + 1];) and swapping them if necessary. However, on GPU, these local arrays in some ways don't exist like on CPU. When they are indexed with a dynamic index (not known at compile time) the full array has to be placed in thread local memory (which is in the GPU main memory, so much slower than registers) and then the indexed value can be retrieved. Ideally, the shape factor would be fully rewritten to not include local arrays (see https://github.com/Hi-PACE/hipace/blob/40fed7b01b1b985ecd6b417c37ec99d76ef2bc7a/src/particles/particles_utils/ShapeFactors.H#L123).

Another issue is that in the PushPx kernel, the shape factor nox and galerkin_interpolation is a runtime parameter, meaning that all possible shape factors have to be compiled in the same kernel. This can be fixed by extending the CTOs to nox and galerkin_interpolation.

Finally, when looking at the assembly, keep in mind that instructions like add and move are much cheaper compared to loading and storing global (and thread local) memory.

@AlexanderSinn
Copy link
Member

Yes that is correct. What I mean with a dynamic index is the index when the array is used (not defined) in 

// Gather field on particle Exp from field on grid ex_arr
for (int iz=0; iz<=depos_order; iz++){
for (int iy=0; iy<=depos_order; iy++){
for (int ix=0; ix<= depos_order - galerkin_interpolation; ix++){
Exp += sx_ex[ix]*sy_ex[iy]*sz_ex[iz]*
ex_arr(lo.x+j_ex+ix, lo.y+k_ex+iy, lo.z+l_ex+iz);
}
}
}

for example, ix in sx_ex[ix]. This is a special case where the loop bounds of the for loop are known at compile time, and in case the loop is fully unrolled by the compiler, ix is also known at compile time and using local memory can be avoided. However, unrolling the loop requires much more registers. But I suspect because the array can be swapped when its initialized ((ex_type[zdir] == NODE) ? sz_node   : sz_cell  );  it has to use local memory anyway.

@liuyuchuncn liuyuchuncn changed the title Kernel : PushPX 50% slower than A100 Optimized PushPX Sep 6, 2024
@liuyuchuncn
Copy link
Author

liuyuchuncn commented Sep 6, 2024

Hi @AlexanderSinn I agree with you
In real-time operation,only use 2 arrays, Such as sz_node sz_cell_v , But at compile time, the compiler does not know which one to use,So The compiler needs to allocate space for 4 arrays sz_node 、sz_cell、sz_node_v、 sz_cell_v , Some arrays can only be placed in local memory due to insufficient registers.

const amrex::Real z = (zp-xyzmin.z)*dinv.z;
amrex::Real sz_node[depos_order + 1];
amrex::Real sz_cell[depos_order + 1];
amrex::Real sz_node_v[depos_order + 1 - galerkin_interpolation];
amrex::Real sz_cell_v[depos_order + 1 - galerkin_interpolation];
int l_node = 0;
int l_cell = 0;
int l_node_v = 0;
int l_cell_v = 0;
if ((ex_type[zdir] == NODE) || (ey_type[zdir] == NODE) || (bz_type[zdir] == NODE)) {
l_node = compute_shape_factor(sz_node, z);
}
if ((ex_type[zdir] == CELL) || (ey_type[zdir] == CELL) || (bz_type[zdir] == CELL)) {
l_cell = compute_shape_factor(sz_cell, z - 0.5_rt);
}
if ((ez_type[zdir] == NODE) || (bx_type[zdir] == NODE) || (by_type[zdir] == NODE)) {
l_node_v = compute_shape_factor_galerkin(sz_node_v, z);
}
if ((ez_type[zdir] == CELL) || (bx_type[zdir] == CELL) || (by_type[zdir] == CELL)) {
l_cell_v = compute_shape_factor_galerkin(sz_cell_v, z - 0.5_rt);
}
const amrex::Real (&sz_ex)[depos_order + 1 ] = ((ex_type[zdir] == NODE) ? sz_node : sz_cell );
const amrex::Real (&sz_ey)[depos_order + 1 ] = ((ey_type[zdir] == NODE) ? sz_node : sz_cell );
const amrex::Real (&sz_ez)[depos_order + 1 - galerkin_interpolation] = ((ez_type[zdir] == NODE) ? sz_node_v : sz_cell_v);
const amrex::Real (&sz_bx)[depos_order + 1 - galerkin_interpolation] = ((bx_type[zdir] == NODE) ? sz_node_v : sz_cell_v);
const amrex::Real (&sz_by)[depos_order + 1 - galerkin_interpolation] = ((by_type[zdir] == NODE) ? sz_node_v : sz_cell_v);
const amrex::Real (&sz_bz)[depos_order + 1 ] = ((bz_type[zdir] == NODE) ? sz_node : sz_cell );
int const l_ex = ((ex_type[zdir] == NODE) ? l_node : l_cell );
int const l_ey = ((ey_type[zdir] == NODE) ? l_node : l_cell );
int const l_ez = ((ez_type[zdir] == NODE) ? l_node_v : l_cell_v);
int const l_bx = ((bx_type[zdir] == NODE) ? l_node_v : l_cell_v);
int const l_by = ((by_type[zdir] == NODE) ? l_node_v : l_cell_v);
int const l_bz = ((bz_type[zdir] == NODE) ? l_node : l_cell );

In addition, if the compiler knows the number of for loops, it will unroll the loop, which will increase the use of registers and cause greater register pressure.
// Gather field on particle Exp from field on grid ex_arr
for (int iz=0; iz<=depos_order; iz++){
for (int iy=0; iy<=depos_order; iy++){
for (int ix=0; ix<= depos_order - galerkin_interpolation; ix++){
Exp += sx_ex[ix]*sy_ex[iy]*sz_ex[iz]*
ex_arr(lo.x+j_ex+ix, lo.y+k_ex+iy, lo.z+l_ex+iz);
}
}
}

@liuyuchuncn
Copy link
Author

liuyuchuncn commented Sep 6, 2024

the input file i uesd automated_test_1_uniform_rest_32ppc and add cell config as follow

max_step = 100
amr.n_cell =  128  128 192
amr.max_grid_size = 256 
amr.blocking_factor = 32

Reduce Local memory access issues by putting arrays into shared memory, The pseudocode is modified as follows

    __shared__ amrex::Real sshared[(depos_order + 1)*4*blockIdx.x];
    amrex::Real * sz_node =   sshared[threadIx.x * 16 + 0];
    amrex::Real * sz_cell =   sshared[threadIx.x * 16 + 4];
    amrex::Real * sz_node_v = sshared[threadIx.x * 16 + 8];
    amrex::Real * sz_cell_v = sshared[threadIx.x * 16 + 12];

Performance improvement is only 3%

@liuyuchuncn
Copy link
Author

Shared memory is useful for me in reducing local memory , close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: cuda Specific to CUDA execution (GPUs) Performance optimization
Projects
None yet
Development

No branches or pull requests

4 participants