Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] A new shared memory collectives component #10470

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

devreal
Copy link
Contributor

@devreal devreal commented Jun 14, 2022

This PR adds coll/smdirect, a clone of coll/sm that relies on cross-process memory mapping as provided by XPMEM. In contrast to coll/sm, data is not copied into an intermediate buffer but buffers are registered with xpmem and the access keys exchanged. Processes can then copy data directly from the source to the target buffer. A similar synchronization mechanism using atomic flags is used to wait for data availability. We currently implemented broadcast, barrier, reduce, and allreduce (as a combination of reduce and bcast). Eventually, this component should replace the (apparently unmaintained) coll/sm component.

Below are some performance measurements on Hawk (2x64 core AMD EPYC system installed at HLRS), min/max/avg taken from OSU benchmarks: the new component shows good bandwidth for larger messages. For small messages, however, both coll/tuned and coll/sm show significantly lower minimum times (and thus lower average) due to buffering (intermediate buffer in coll/sm, eager messages in coll/tuned). This will be addressed in coll/smdirect in future work by buffering small messages in a pre-registered buffer. Interesting are the maximum times (the longest time any process spends in the collective), where coll/smdirect is competitive for reduce operations and provides significant improvements for large broadcasts. The current implementation of coll/allreduce

reduce_osu_hawk_smdirect_1689197 hawk-pbs5-7
bcast_osu_hawk_smdirect_1689197 hawk-pbs5-7
allreduce_osu_hawk_smdirect_1693001 hawk-pbs5-7

In the OSU benchmarks, the barrier implementation in coll/sm is faster (2.5us) than coll/smdirect (5.3us) since coll/sm is set up to partially overlap the execution of two consecutive barriers. Both are faster than coll/tuned (6.3us) though.

This PR is work in progress. Things to do:

  • Add NUMA awareness.
  • Add more collective operations and refine the allreduce implementation.
  • Add small-data buffering, similar to coll/sm, to improve latency.
  • Investigate the use of smsc components that do not provide mapping capabilities (cma, knem).
  • Investigate interaction with coll/han (for the node-local portion).
  • Cleanup code and commits.

This PR includes the fixes to smsc/xpmem in #10127 and needs to be rebased once that is merged.

This is a copy of coll/sm leveraging the opal/smsc component
for single-copy collectives.


Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Children wait for all their segments to be consumed, parents wait
at the start of an op for their children to complete.

Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
@devreal devreal requested a review from bosilca June 14, 2022 14:43
@devreal devreal marked this pull request as draft June 22, 2022 14:50
@gpaulsen
Copy link
Member

@devreal What's the status of this? I assume this is post v5.0 feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants