Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesigned "caar loop pre-boundary exchange", tuned for Frontier #6522

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

trey-ornl
Copy link
Contributor

This pull request attempts to provide Frontier optimizations like those used in the 2023 Gordon Bell Climate runs, but with a software architecture that meets the requirements of https://acme-climate.atlassian.net/wiki/x/oICd6. Here is a summary of the changes.

  • Create multiple new structs in SphereOperators.hpp to use for the new Caar pre-boundary exchange.
  • Add new file CaarFunctorImpl.cpp that implements the new caar_compute function and template functions with Kokkos loops.
  • Update CaarFunctorImpl.hpp with new functions, different number of buffers, and #if to turn on/off the new caar_compute. If we adopt the new implementation permanently, significant code can be eliminated from this file.
  • Add new viewAsReal functions in ViewUtils.hpp.
  • Remove LaunchBounds<512,1> calls, which I think are incorrect for AMD GPUs, where the Kokkos teams sometimes use 1024 threads.
  • Update ExecSpaceDefs.cpp to use newer Kokkos command-line argument for MPI.
  • Move to Cray Programming Environment cpe/23.12 and Rocm rocm/5.7.1.
  • Add frontier.cmake for standalone Homme builds.
  • Delete old Crusher files for CMake.

I confirmed that the modified code passes the caar_ut unit test using frontier-bfb.cmake, and I ran a single-node performance test of stand-alone Homme provided to me by @oksana, building with the new frontier.cmake. Here are the walltotal times recorded in HommeTime_stats for "caar compute".

Existing Code (#if 0): 153.6221 s
New Code (#if 1): 4.777188 s

I feel like this improvement is "too much." I confirmed that the outputs of the performance runs matched within roundoff. And I confirmed the performance difference using rocprof.

I then compiled CaarFunctor with -Rpass-analysis=kernel-resource-usage. While all the kernels of the new code had a grand total of just one vector-register spill, the monolithic existing kernel had 156115 scalar-register spills and 2820 vector-register spills using rocm/5.7.1. Is this enough to explain the performance gulf? I wonder if current E3SM runs are seeing this slowdown.

Here are a few caveats about this pull request.

  • I don't know how it performs on NVidia GPUs. I am hopeful that it increases performance.
  • I don't know if it even works on CPUs. It uses standard Kokkos, but it relies on active team parallelism (instead of TeamThreadRange loops).

@ambrad ambrad removed their request for review July 23, 2024 16:20
@ambrad
Copy link
Member

ambrad commented Jul 23, 2024

This PR looks like it will be quite a bit of effort to merge in its current state. Perhaps it's premature to post it. Perhaps it would be better to discuss the branch on a Confluence page before finalizing a PR.

Trey, when you have a chance, would you post full timer output for the two runs in which you see the 32x performance discrepancy?

@trey-ornl
Copy link
Contributor Author

This PR looks like it will be quite a bit of effort to merge in its current state. Perhaps it's premature to post it. Perhaps it would be better to discuss the branch on a Confluence page before finalizing a PR.

Trey, when you have a chance, would you post full timer output for the two runs in which you see the 32x performance discrepancy?

I created the following Confluence page for discussion. It includes the full timer output.
https://acme-climate.atlassian.net/wiki/x/JwDvCgE

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some preliminary comments. As Andrew said, this PR is quite big, and may be worth see if/how we can chunk it. But before then, we should find a way to make things work for Serial/OpenMP/Cuda backends.

du += dvvy[j] * t0.sv(x,j);
dv += dvvx[j] * t1.sv(j,y);
}
if (rrdmd == 0) rrdmd = (1.0 / metdet) * scale_factor_inv;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to always be true. Maybe you could just do it in the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Divisions are expensive on AMD GPUs. The attempt here is to do it only once per kernel, even if the kernel calls div or fort more than once, or both div and vort, and to do it zero times if the kernel never calls div or vort (but calls only grad, for example).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The F90 data structures do stored rmeted=1/metdet, but the C++ ones don't. I don't recall why we chose to do that. Perhaps we should do as in F90, since 1/metdet does not change across the run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want me to try converting to a stored rmetdet and see how performance compares?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for that. To avoid doing the division at every kernel call, we'd have to precompute the rmetdet view during init, and storing in the ElementsGeometry structure. I don't think that's a change for this PR. I was just thinking out loud :)


KOKKOS_INLINE_FUNCTION void vortInit(SphereBlockScratch &t0, SphereBlockScratch &t1, const Real v0, const Real v1) const
{
t0.sv(x,y) = g.d(e,0,0,x,y) * v0 + g.d(e,0,1,x,y) * v1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed for dinv you store a copy of the point-local 2x2 matrix in the class, while for d you don't. Would it benefit to store d locally too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the tuning process was trying different combinations of local copies. A local copy of dinv happened to show speedup, while a local copy of d did not. I hypothesize that the tiny L1 cache happens to be big enough to reuse the required d values without making an explicit local copy.

const ExecViewManaged<const Real *[NP][NP]> metdet;
const Real scale_factor_inv;

SphereGlobal(const SphereOperators &op):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since SphereGlobal stores stuff that is already in SphereOperators, what is the point of SphereGlobal objects? One could simply pass around a SphereOperator object, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a lambda captures an object, it makes a full copy of that object, regardless of what members the lambda actually uses. The SphereGlobal struct attempts to minimize the number of SphereOperators members that get copied into the kernel. (When a Kokkos::View gets copied, it copies only the array descriptor, not all the values of the array. But SphereOperators still has more of them than are actually used by each kernel.)

struct SphereBlockOps;

struct SphereBlockScratch {
SphereBlockScratchView v;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the v part of this type ever used. So maybe we only need (NP,NP) scratch views?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It provides the actual space for the sv subviews.

sv(Kokkos::subview(v, b.z % SPHERE_BLOCK_LEV, Kokkos::ALL, Kokkos::ALL))

Each vertical level uses a separate (NP,NP) piece of the shared array.

const SphereGlobal &g;
const Team &t;
Real scale_factor_inv;
Real metdet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think you need to store these local values (like dvvx, dvvy, metdet, ...)? Does it save on indexing time? I didn't expect this to matter much, tbh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the particular reason each local copy wins. It was all empirical. If a value was reused, I tried making a local copy. If the result was faster, I kept it.
Possible reasons are that it keeps the compiler from computing indices or loading values more than once. Integer calculation for array indices can be surprisingly expensive on AMD GPUs.

metdet = g.metdet(e,x,y);
for (int i = 0; i < 2; i++) for (int j = 0; j < 2; j++) dinv[i][j] = g.dinv(e,i,j,x,y);
for (int j = 0; j < NP; j++) {
dvvx[j] = g.dvv(x,j);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you subview the first index, maybe you could just make dvvx/dvvy two kokkos subviews of g.dvv, to avoid a copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because SphereBlockOps objects are on the kernel stack, dvvx and dvvy only exist in vector registers. Each "copy" is a single load that is required regardless. And Kokkos subviews can be register hogs.


const Real v0 = state_v(b.e,data_n0,0,b.x,b.y,b.z) * state_dp3d(b.e,data_n0,b.x,b.y,b.z);
const Real v1 = state_v(b.e,data_n0,1,b.x,b.y,b.z) * state_dp3d(b.e,data_n0,b.x,b.y,b.z);
buffers_vdp(b.e,0,b.x,b.y,b.z) = v0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffers are allocated based on the number of concurrent slots needed, but with SphereBlockOps (as well as with SphereColOps and SphereScanOps) we access them with the element index. We need to polish this out, so that we can use "team_idx" instead of the element index for the buffers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use buffers to store values across kernels, so I need separate values for every element. I think you are saying that the buffers are actually allocated based on the number of teams, and the number of teams being the same as the number of local elements is a coincidence that I should not rely upon?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not notice that the epochs are now separate kernels (rather than separate parts of the same kernel). That's a big change. It reminds me of what we did in SCREAM for some physics parametrizations, where an implementation with many smaller kernels was better than a monolithic one. But we should definitely see if this impl is also good for CUDA.

That said, yes, the buffers are allocated based on the maximum number of concurrent teams. If the number of (rank-local) elements is large enough, even on GPU we can get num_buffer_slots<num_elements, which means those buffers may be smaller than num_elems. This is pretty much always the case on CPU, btw, so this impl needs to be adjusted to be portable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following to-do for me? Convert to using a separate set of buffers, created to be the required sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean make the new impl use different buffers from the original impl? That would work. But related to that, I am getting a bit concerned about how implementations for different backends may start do dramatically diverge. So yes, do the fix you suggest, but we need to start verification of other backends.

using Team = TeamPolicy::member_type;

static constexpr int NPNP = NP * NP;
static constexpr int WARP_SIZE = warpSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think warpSize is a HIP specific variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Cuda has it, but it's not a compile-time constant. I need a compile-time workaround. #ifdef? Other suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also not defined for serial/openmp, which means we do need some sort of CPP guard. I guess #ifdef would do.

const int ixy = tr / SPHERE_BLOCK_LEV;
x = ixy / NP;
y = ixy % NP;
const int dz = tr % SPHERE_BLOCK_LEV;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This snippet below confuse me. I don't think team_rank() uniquely identifies a thread in a warp, but rather it identifies the warp as a whole. So I'm confused as of why this code can work correctly for all levels.

In Kokkos, team_rank() returns threadIdx.y, but a block (i.e., a team) is 2d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SphereBlockOps::policy, the team_rank() does uniquely identify a thread, because the Kokkos "vector" dimension is one (no third argument in the constructor):

return TeamPolicy(num_elems * SPHERE_BLOCKS_PER_COL, SPHERE_BLOCK).

static constexpr int SPHERE_BLOCK = NPNP * SPHERE_BLOCK_LEV;

Other team policies do have nonzero vector lengths, so their team_rank() values do not uniquely identify a thread. This is the case for the existing code and for SphereScanOps::policy in the new code (three arguments in constructor):
return TeamPolicy(num_elems, NPNP, WARP_SIZE);

@rljacob
Copy link
Member

rljacob commented Aug 15, 2024

Is this ready?

@trey-ornl
Copy link
Contributor Author

Is this ready?

I doubt it. It has only been test on Frontier.

@rljacob
Copy link
Member

rljacob commented Sep 5, 2024

How do you want to test this on other machines? @trey-ornl do you have perlmutter access?

@trey-ornl
Copy link
Contributor Author

trey-ornl commented Sep 5, 2024

How do you want to test this on other machines? @trey-ornl do you have perlmutter access?

Sorry, no, I don't have Perlmutter access.

Once I have E3SM-Project/scream#2969 through on scream, I want to retest performance on Frontier.

trey-ornl added a commit to trey-ornl/scream that referenced this pull request Sep 24, 2024
trey-ornl added a commit to trey-ornl/scream that referenced this pull request Oct 4, 2024
@rljacob
Copy link
Member

rljacob commented Oct 31, 2024

@trey-ornl did you get a chance to try on perlmutter?

@rljacob
Copy link
Member

rljacob commented Oct 31, 2024

Notes: need to check cpu performance. Look for less-invasive way to do it.

@trey-ornl
Copy link
Contributor Author

@trey-ornl did you get a chance to try on perlmutter?

Yes. It also sped up on pm-gpu, but it slowed down pm-cpu too much. I'm working on improvements for pm-cpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants