Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/rdma: Handle remote completion behavior #9975

Merged
merged 13 commits into from
Feb 16, 2022

Conversation

bwbarrett
Copy link
Member

This patch series started to address data corruption issues due to BTLs not providing the expected remote completion. Fixing this required two changes: 1) "accelerated" BTLs (the fast case, OFI, GNI, etc.) require REMOTE_COMPLETION or are disabled and 2) the AM RDMA compatibility interface was refactored to only use the underlying BTL RDMA functions if the BTL provides remote completion.

To implement the alternate BTL AM RDMA behavior, we use the explicit AM RDMA interface (instead of asking AM RDMA to extend the BTL), which required refactoring most of the BTL initialization code. To use the explicit AM RDMA interface, we ended up refactoring the communication calls a bit to avoid lots of extra if statements through the communication code.

While rewriting the alternate btl initialization, the btl selection bug in #7830 and #9630 should be fixed, including the btl selection impedance mismatch. It is no longer possible to specify alternate BTLs in osc rdma, as it now follows the active BTL list (similar to the OB1 PML).

@bwbarrett
Copy link
Member Author

Rebased on top of the recently added SM fixes, and added a priority adjustment patch now that the OSC SM component works again.

@wzamazon
Copy link
Contributor

wzamazon commented Feb 8, 2022

Did a quick look, and all looks good to me. Thank you!

I will checkout the branch to read again.

@bwbarrett bwbarrett force-pushed the bugfix/rdma-osc branch 2 times, most recently from a925b33 to 4281d4a Compare February 9, 2022 17:38
@bwbarrett
Copy link
Member Author

Addressed Wei's feedback about the leftover comment around query_accelerated_btls() and a NULL module pointer, fixed the rebase error that somehow crept in (I think it was a bad manual resolution of a rebase) that caused the rdma component to never work, and fixed an error in the am-rdma wrapper code that happened in further testing with BTLs that support get() operations but not remote completion of puts.

Copy link
Contributor

@wzamazon wzamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! overall looks good to me.

I left two comments. One is minor. The other is about a bug in use_cpu_atomics in the original code, which I think can be addressed in this PR because you are working on this part of code. or we can fix it by a separate PR.

ompi/mca/osc/rdma/osc_rdma_component.c Outdated Show resolved Hide resolved
ompi/mca/osc/rdma/osc_rdma_component.c Outdated Show resolved Hide resolved
The put over get implementation passed the wrong pointer to the
cbdata of the get call, resulting in all kinds of badness when
the get completed.  This patch passes through the expected pointer.

Signed-off-by: Brian Barrett <[email protected]>
Clean up language around btl selection phases to match rest of the
code (accelerated/alternate).

Also switch component initialization code to use opal_output_verbose
rather than OSC_RDMA_VERBOSE, so that the debugging prints can
be enabled on a non-debug build.

Signed-off-by: Brian Barrett <[email protected]>
See lengthy comment in the change, but this patch removes the
ability of users to specify a subset of available btls for use
by the osc rdma component.  The BTL interface was never designed
for such usage (which is why there is no similar option for the
OB1 PML) and it had clear places where it broke, so remove it.

Signed-off-by: Brian Barrett <[email protected]>
With the change to load all btls in the alternate case and the
removal of the logic to change priority based on alternate or
accelerated btls, there's no point in running the full btl
query in osc_rdma_component_query().  Instead, just verify
that there is some number of BTLs available, which is the
extent of testing in the alternate case.

Signed-off-by: Brian Barrett <[email protected]>
Refactor wrapper calls around the btl atomic code into their own
header, removing them from the lock interface header.  Clean up
some resulting header dependency changes.

Signed-off-by: Brian Barrett <[email protected]>
Add wrappers to call all BTL RDMA functions.  This commit just adds
another layer of indirection to the communication calls.  However,
a follow-on patch will add logic to either call the BTL RDMA
functions directly (for accelerated mode) or call the BTL AM RDMA
compatibility layer for alternate mode.

Signed-off-by: Brian Barrett <[email protected]>
Split the storage of btls based on whether the window is using
accelerated or alternate btls.  This makes it more obvious when
the code has made assumptions about mode that may not be true
(such as the memory registration calls throughout the code that
assumed selected_btl[0] was the one true BTL).

Signed-off-by: Brian Barrett <[email protected]>
ompi_osc_rdma_btl_op() already includes a check to emulate a remote
op with a remote fetch-and-op, so there's no need for a second check
in ompi_osc_rdma_acc_single_atomic().

Signed-off-by: Brian Barrett <[email protected]>
Use the opal_min() function instead of a custom macro.

Signed-off-by: Brian Barrett <[email protected]>
The use_cpu_atomics member of the rdma module was only used during
initialization, so there was no reason for it to be a module
member.

Also clean up the initialization logic for the field (and therefore,
whether or not the shared state optimization is used).  Previously,
it appeared to be possible to use with alternate btls across multiple
nodes (ie, we tested the GLOB value on the btls), but the
use_cpu_atomics field was initialized to false in the multi-node case
and so they would never actually be enabled.  Make that more obvious,
rather than try to enable the optimization, to reduce the risk of
introducing new datapath bugs.

Signed-off-by: Brian Barrett <[email protected]>
Switch from using the implicit BTL interface (where the am-rdma
interface just extends missing functionality in the BTL) to the
new explicit interface (where the OSC RDMA interface is the
only maintainer of the BTL list.

With this change, alternate BTLs do not have to support
REMOTE_COMPLETION to be selected (because the AM RDMA interface
always provides remote completion when we request it, as this
patch does).  Any BTL that supports Active Messages (ie, all of
them) should be able to support the OSC RDMA required semantics,
eliminating the problem of creating windows with no servicable
BTLs.

Signed-off-by: Brian Barrett <[email protected]>
The RDMA OSC component is the "general" component, similar to OB1.
The OSC subsystem has gone through some priority inflation over
the years, so try to clean that up. This patch lowers the RDMA
OSC component priority from 101 to 20, leaving the priorities as:

  sm      100
  ucx     60 (if API version > 1.5.0)
  portals 20
  rdma    20

Signed-off-by: Brian Barrett <[email protected]>
Fix an assignment of request onto itself, which was a harmless typo,
but was responsible for Coverity issues CID 1429865 and CID 1429864.

Signed-off-by: Brian Barrett <[email protected]>
@bwbarrett
Copy link
Member Author

rebased and addressed Wei's comments.

@bwbarrett bwbarrett merged commit d94e0bb into open-mpi:master Feb 16, 2022
@bwbarrett bwbarrett deleted the bugfix/rdma-osc branch February 16, 2022 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants