[WIP] A new shared memory collectives component #10470

devreal · 2022-06-14T14:43:10Z

This PR adds coll/smdirect, a clone of coll/sm that relies on cross-process memory mapping as provided by XPMEM. In contrast to coll/sm, data is not copied into an intermediate buffer but buffers are registered with xpmem and the access keys exchanged. Processes can then copy data directly from the source to the target buffer. A similar synchronization mechanism using atomic flags is used to wait for data availability. We currently implemented broadcast, barrier, reduce, and allreduce (as a combination of reduce and bcast). Eventually, this component should replace the (apparently unmaintained) coll/sm component.

Below are some performance measurements on Hawk (2x64 core AMD EPYC system installed at HLRS), min/max/avg taken from OSU benchmarks: the new component shows good bandwidth for larger messages. For small messages, however, both coll/tuned and coll/sm show significantly lower minimum times (and thus lower average) due to buffering (intermediate buffer in coll/sm, eager messages in coll/tuned). This will be addressed in coll/smdirect in future work by buffering small messages in a pre-registered buffer. Interesting are the maximum times (the longest time any process spends in the collective), where coll/smdirect is competitive for reduce operations and provides significant improvements for large broadcasts. The current implementation of coll/allreduce

In the OSU benchmarks, the barrier implementation in coll/sm is faster (2.5us) than coll/smdirect (5.3us) since coll/sm is set up to partially overlap the execution of two consecutive barriers. Both are faster than coll/tuned (6.3us) though.

This PR is work in progress. Things to do:

Add NUMA awareness.
Add more collective operations and refine the allreduce implementation.
Add small-data buffering, similar to coll/sm, to improve latency.
Investigate the use of smsc components that do not provide mapping capabilities (cma, knem).
Investigate interaction with coll/han (for the node-local portion).
Cleanup code and commits.

This PR includes the fixes to smsc/xpmem in #10127 and needs to be rebased once that is merged.

This is a copy of coll/sm leveraging the opal/smsc component for single-copy collectives. Signed-off-by: Joseph Schuchart <[email protected]>

Signed-off-by: Joseph Schuchart <[email protected]>

Children wait for all their segments to be consumed, parents wait at the start of an op for their children to complete. Signed-off-by: Joseph Schuchart <[email protected]>

Signed-off-by: Joseph Schuchart <[email protected]>

gpaulsen · 2022-09-29T14:53:43Z

@devreal What's the status of this? I assume this is post v5.0 feature?

devreal added 21 commits June 10, 2022 14:19

Initial implementation coll/smdirect

2be8421

This is a copy of coll/sm leveraging the opal/smsc component for single-copy collectives. Signed-off-by: Joseph Schuchart <[email protected]>

Coll/smdirect: cache the local endpoint in the shared memory segment

cb89be7

Signed-off-by: Joseph Schuchart <[email protected]>

Properly free the endpoint

92e57ec

Signed-off-by: Joseph Schuchart <[email protected]>

Use cached peer endpoint in reduce

5ba59ea

Signed-off-by: Joseph Schuchart <[email protected]>

Cache peer endpoint and avoid dynamic allocation

23d917f

Signed-off-by: Joseph Schuchart <[email protected]>

Fix SPIN_CONDITION to avoid excessive progress

cb0313f

Signed-off-by: Joseph Schuchart <[email protected]>

Fix allocation to endpoints cache

502d6c5

Signed-off-by: Joseph Schuchart <[email protected]>

Avoid copying data, use 3buff reductions instead if possible

7073e7d

Signed-off-by: Joseph Schuchart <[email protected]>

Clean up flag handling

19f745e

Children wait for all their segments to be consumed, parents wait at the start of an op for their children to complete. Signed-off-by: Joseph Schuchart <[email protected]>

Fix printing communicator ID

29a9c8f

Signed-off-by: Joseph Schuchart <[email protected]>

Implement double buffering in smdirect/reduce

bcfca54

Signed-off-by: Joseph Schuchart <[email protected]>

Cleanup

fe76210

Signed-off-by: Joseph Schuchart <[email protected]>

First draft of smdirect/bcast

6f03d73

Signed-off-by: Joseph Schuchart <[email protected]>

Implement coll/smdirect/bcast using xpmem

f5dc10b

Signed-off-by: Joseph Schuchart <[email protected]>

smdirect/bcast: Replace pre-allocated iovec with dynamic iteration

1c5146b

Signed-off-by: Joseph Schuchart <[email protected]>

Implement coll/smdirect/barrier using flags

d5eaebc

Signed-off-by: Joseph Schuchart <[email protected]>

Fix coll/smdirect/bcast enabling

ce88428

Signed-off-by: Joseph Schuchart <[email protected]>

Some fixes to smdirect/barrier

12f3e27

Signed-off-by: Joseph Schuchart <[email protected]>

Add smdirect/barrier to makefile

322e997

Signed-off-by: Joseph Schuchart <[email protected]>

coll/smdirect: Fix several synchronization issues in bcast and barrier

1c56c22

Signed-off-by: Joseph Schuchart <[email protected]>

Code cleanup

91079f2

Signed-off-by: Joseph Schuchart <[email protected]>

devreal requested a review from bosilca June 14, 2022 14:43

jsquyres added the Target: main label Jun 22, 2022

devreal marked this pull request as draft June 22, 2022 14:50

bosilca mentioned this pull request Apr 22, 2024

Add the acoll component #12484

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] A new shared memory collectives component #10470

[WIP] A new shared memory collectives component #10470

devreal commented Jun 14, 2022

gpaulsen commented Sep 29, 2022

[WIP] A new shared memory collectives component #10470

Are you sure you want to change the base?

[WIP] A new shared memory collectives component #10470

Conversation

devreal commented Jun 14, 2022

gpaulsen commented Sep 29, 2022