-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osc/rdma: Handle remote completion behavior #9975
Conversation
81a1934
to
86868df
Compare
Rebased on top of the recently added SM fixes, and added a priority adjustment patch now that the OSC SM component works again. |
Did a quick look, and all looks good to me. Thank you! I will checkout the branch to read again. |
a925b33
to
4281d4a
Compare
Addressed Wei's feedback about the leftover comment around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! overall looks good to me.
I left two comments. One is minor. The other is about a bug in use_cpu_atomics
in the original code, which I think can be addressed in this PR because you are working on this part of code. or we can fix it by a separate PR.
The put over get implementation passed the wrong pointer to the cbdata of the get call, resulting in all kinds of badness when the get completed. This patch passes through the expected pointer. Signed-off-by: Brian Barrett <[email protected]>
Clean up language around btl selection phases to match rest of the code (accelerated/alternate). Also switch component initialization code to use opal_output_verbose rather than OSC_RDMA_VERBOSE, so that the debugging prints can be enabled on a non-debug build. Signed-off-by: Brian Barrett <[email protected]>
See lengthy comment in the change, but this patch removes the ability of users to specify a subset of available btls for use by the osc rdma component. The BTL interface was never designed for such usage (which is why there is no similar option for the OB1 PML) and it had clear places where it broke, so remove it. Signed-off-by: Brian Barrett <[email protected]>
With the change to load all btls in the alternate case and the removal of the logic to change priority based on alternate or accelerated btls, there's no point in running the full btl query in osc_rdma_component_query(). Instead, just verify that there is some number of BTLs available, which is the extent of testing in the alternate case. Signed-off-by: Brian Barrett <[email protected]>
Refactor wrapper calls around the btl atomic code into their own header, removing them from the lock interface header. Clean up some resulting header dependency changes. Signed-off-by: Brian Barrett <[email protected]>
Add wrappers to call all BTL RDMA functions. This commit just adds another layer of indirection to the communication calls. However, a follow-on patch will add logic to either call the BTL RDMA functions directly (for accelerated mode) or call the BTL AM RDMA compatibility layer for alternate mode. Signed-off-by: Brian Barrett <[email protected]>
Split the storage of btls based on whether the window is using accelerated or alternate btls. This makes it more obvious when the code has made assumptions about mode that may not be true (such as the memory registration calls throughout the code that assumed selected_btl[0] was the one true BTL). Signed-off-by: Brian Barrett <[email protected]>
ompi_osc_rdma_btl_op() already includes a check to emulate a remote op with a remote fetch-and-op, so there's no need for a second check in ompi_osc_rdma_acc_single_atomic(). Signed-off-by: Brian Barrett <[email protected]>
Use the opal_min() function instead of a custom macro. Signed-off-by: Brian Barrett <[email protected]>
The use_cpu_atomics member of the rdma module was only used during initialization, so there was no reason for it to be a module member. Also clean up the initialization logic for the field (and therefore, whether or not the shared state optimization is used). Previously, it appeared to be possible to use with alternate btls across multiple nodes (ie, we tested the GLOB value on the btls), but the use_cpu_atomics field was initialized to false in the multi-node case and so they would never actually be enabled. Make that more obvious, rather than try to enable the optimization, to reduce the risk of introducing new datapath bugs. Signed-off-by: Brian Barrett <[email protected]>
Switch from using the implicit BTL interface (where the am-rdma interface just extends missing functionality in the BTL) to the new explicit interface (where the OSC RDMA interface is the only maintainer of the BTL list. With this change, alternate BTLs do not have to support REMOTE_COMPLETION to be selected (because the AM RDMA interface always provides remote completion when we request it, as this patch does). Any BTL that supports Active Messages (ie, all of them) should be able to support the OSC RDMA required semantics, eliminating the problem of creating windows with no servicable BTLs. Signed-off-by: Brian Barrett <[email protected]>
The RDMA OSC component is the "general" component, similar to OB1. The OSC subsystem has gone through some priority inflation over the years, so try to clean that up. This patch lowers the RDMA OSC component priority from 101 to 20, leaving the priorities as: sm 100 ucx 60 (if API version > 1.5.0) portals 20 rdma 20 Signed-off-by: Brian Barrett <[email protected]>
Fix an assignment of request onto itself, which was a harmless typo, but was responsible for Coverity issues CID 1429865 and CID 1429864. Signed-off-by: Brian Barrett <[email protected]>
030e0b5
to
ad9fae9
Compare
rebased and addressed Wei's comments. |
This patch series started to address data corruption issues due to BTLs not providing the expected remote completion. Fixing this required two changes: 1) "accelerated" BTLs (the fast case, OFI, GNI, etc.) require REMOTE_COMPLETION or are disabled and 2) the AM RDMA compatibility interface was refactored to only use the underlying BTL RDMA functions if the BTL provides remote completion.
To implement the alternate BTL AM RDMA behavior, we use the explicit AM RDMA interface (instead of asking AM RDMA to extend the BTL), which required refactoring most of the BTL initialization code. To use the explicit AM RDMA interface, we ended up refactoring the communication calls a bit to avoid lots of extra if statements through the communication code.
While rewriting the alternate btl initialization, the btl selection bug in #7830 and #9630 should be fixed, including the btl selection impedance mismatch. It is no longer possible to specify alternate BTLs in osc rdma, as it now follows the active BTL list (similar to the OB1 PML).