Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/base: add subsystem to support for active-message based RDMA/atomic emulation #8493

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Feb 17, 2021

No description provided.

@hjelmn
Copy link
Member Author

hjelmn commented Feb 17, 2021

Needed to provide working RMA when no RDMA-capable device is present.

@awlauria
Copy link
Contributor

bot:aws:retest

@awlauria
Copy link
Contributor

Is this series of commits intended to fix #7830 ?

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, except for a few nit-picks.

ompi/mca/osc/rdma/osc_rdma.h Outdated Show resolved Hide resolved
opal/mca/btl/base/btl_base_am_rdma.c Outdated Show resolved Hide resolved
opal/mca/btl/base/btl_base_am_rdma.c Outdated Show resolved Hide resolved
if (NULL == operation) {
/* couldn't even allocate a small amount of memory. not much else can be done. */
BTL_ERROR(("could not allocate memory to queue active-message RDMA operation"));
abort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more graceful way then to call abort here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please do not call abort() -- that can lead to terrabytes of corefiles. ☹️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is either abort or SEGV. IMHO abort is much more graceful. If we get here things are bad and we will abort or SEGV somewhere if we can't even allocate 64 bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abort() drops a corefile -- that's kinda anti-social.

If you need to exit, please call the btl error function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already called. In that case abort is not even reached.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. BTL_ERROR doesn't exit so we still want it to exit here. It will SEGV which will dump core files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I can remove the aborts() but the semantics of this error will not change. It will print the error message then crash when operation is dereferenced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with @jsquyres on the abort thing; I think this is precisely when we want a core file (and, I believe, we call abort elsewhere). This is a should "never, ever happen" type thing, not a "we got our configuration wrong" type thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have an option for dropping corefiles. It is very anti-social to have an MTT run fill up filesystems with terrabytes of corefiles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't happen ever in MTT until you are expecting to completely run out of memory. But, again, in that case we really can't control whether you core or not. Up to your shell to set ulimit -c 0.

opal/mca/btl/base/btl_base_am_rdma.c Outdated Show resolved Hide resolved
ompi/mca/osc/rdma/osc_rdma.h Outdated Show resolved Hide resolved
@hjelmn hjelmn force-pushed the provide_support_for_atomics_in_all_btls_because_this_support_is_needed_to_truely_eliminate_the_awful_osc_pt2pt_component_which_has_gone_away_on_the_master_branch branch 3 times, most recently from 1dbe39d to 1fa152d Compare March 2, 2021 17:59
@awlauria
Copy link
Contributor

awlauria commented Mar 3, 2021

I resolved the conflict in the copyright. Let's move to get this in/verified.

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes, LGTM

@rhc54 rhc54 requested a review from bwbarrett March 3, 2021 17:08
Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rebase the branch and fix up the conflicts, not create another merge commit. @awlauria please undo this.

@awlauria
Copy link
Contributor

awlauria commented Mar 3, 2021

@bwbarrett I agree. However this pr has been open for weeks, I did it for expediency.

If we want this rebased, please ping @hjelmn to do so in a timely manner.

@hjelmn hjelmn force-pushed the provide_support_for_atomics_in_all_btls_because_this_support_is_needed_to_truely_eliminate_the_awful_osc_pt2pt_component_which_has_gone_away_on_the_master_branch branch from 6010c00 to 59de860 Compare March 3, 2021 17:34
Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving, although the header issue makes me really nervous.

opal/mca/btl/base/btl_base_am_rdma.c Show resolved Hide resolved
opal/mca/btl/base/btl_base_am_rdma.c Show resolved Hide resolved
opal/mca/btl/base/btl_base_am_rdma.c Outdated Show resolved Hide resolved
opal/mca/btl/sm/btl_sm_get.c Outdated Show resolved Hide resolved
opal/mca/btl/tcp/btl_tcp_endpoint.c Show resolved Hide resolved
opal/mca/btl/base/btl_base_am_rdma.c Outdated Show resolved Hide resolved
opal/mca/btl/base/btl_base_am_rdma.c Outdated Show resolved Hide resolved
@hjelmn hjelmn force-pushed the provide_support_for_atomics_in_all_btls_because_this_support_is_needed_to_truely_eliminate_the_awful_osc_pt2pt_component_which_has_gone_away_on_the_master_branch branch 2 times, most recently from f0100b1 to d318916 Compare March 3, 2021 20:03
This commit adds a new subsystem to the btl base: active-message
RDMA. The subsystem can be activated on any non-RDMA or partial-
RDMA btl (for example tcp) by calling:

```c
mca_btl_base_am_rdma_init(mca_btl_base_module_t *btl).
```

Once called the btl module supplied with be augmented to have
btl_get, btl_put, btl_fop, and btl_cswap functions. Flags will
be set to indicate that the BTL is using active-message
versions of these functions.

Calling this function on a btl will register a progress
function for handling retries of intiator or target-side
operations. There may be some overhead once the progress
function is registered.

Active-message "RDMA" support is not activated on any BTL
module by default.

Signed-off-by: Nathan Hjelm <[email protected]>
This commit updates osc/rdma to support using alternate BTLs when
a primary BTL is not available. There may be at most two
alternate BTLs in use at any time. The default is selected to
cover shared memory (sm) and off-node (tcp).

The priority of osc/rdma is a bit lower when using a set of
alternate btls. This will allow another osc component to win if
there is an alternative.

Signed-off-by: Nathan Hjelm <[email protected]>
This removes the emulated RDMA code provided as a fallback so btl/sm
could be used for osc/rdma. This functionality is now provided by
```btl_base_am_rdma_init ()``` so the dedicated version in btl/sm
is no longer needed.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn hjelmn force-pushed the provide_support_for_atomics_in_all_btls_because_this_support_is_needed_to_truely_eliminate_the_awful_osc_pt2pt_component_which_has_gone_away_on_the_master_branch branch from d318916 to 214f6a6 Compare March 3, 2021 20:08
@awlauria
Copy link
Contributor

awlauria commented Mar 3, 2021

bot:aws:retest

@gpaulsen
Copy link
Member

gpaulsen commented Mar 4, 2021

ubuntu executor died with a java exception.

@gpaulsen
Copy link
Member

gpaulsen commented Mar 4, 2021

bot:aws:retest

@jsquyres
Copy link
Member

jsquyres commented Mar 4, 2021

I still see 6 uses of abort() on this PR, and I don't know if I agree with all of them. It looks like some of the uses of abort() in this PR are really assert-like checks -- i.e., "if we got here, that's an OMPI programmer error". Other abort()s are set for run-time errors outside the scope of Open MPI, such as network packet corruption. But a corefile on just the receiver (for example) without a corresponding corefile on the sender -- where the assert-like-check won't fail -- may not reveal anything useful. They'll just take up space on the filesystem.

Outside the scope of this PR, I wonder if we should make opal_assert() that would behave exactly like assert(), but it has the option to drop a corefile or not (e.g., via MCA param or somesuch).

I'm really sensitive to this whole corefile issue because there have been many times over the years where MTT caused the shared filesystem I use for Cisco MTT to fill up with terrabytes/petabytes/whateverbytes of corefiles. My colleagues get (rightfully) pissed at me when this happens.

It could be helpful if one could optionally turn on corefiles from opal_assert() because this would let me leave ulimit -c unlimited to catch the random failures where a corefile can definitely be helpful, but are far harder to isolate and reproduce. If we get large-scale opal_assert() failures and a corefile is needed for further analysis, it would seem statistically probable that running again with core dumps enabled would yield what is needed.

@bwbarrett
Copy link
Member

Nathan's right; there's not a great way for OMPI to decide whether to drop a core file or not. That's really a system setting (via ulimit). We use abort() or assert() or similar so liberally in the OMPI code base, that this seems like a really weird place to hold the line.

@rhc54
Copy link
Contributor

rhc54 commented Mar 6, 2021

Given that Cisco isn't running MTT and the folks who are running MTT approve this PR, let's separate the question of what to do with abort (dealing with it in #8557) and get this in so MTT quits failing for those who are running it.

@rhc54 rhc54 merged commit 1049307 into open-mpi:master Mar 6, 2021
@jsquyres
Copy link
Member

jsquyres commented Mar 8, 2021

Given that Cisco isn't running MTT and the folks who are running MTT approve this PR, let's separate the question of what to do with abort (dealing with it in #8557) and get this in so MTT quits failing for those who are running it.

Point of fact: Cisco is running MTT, but it's currently in "trial" mode (i.e., results are not visible by default) because I had some issues with networking and equipment being powered off. Also, I objected to the use of abort() on this PR for two reasons, not just the "filling up of filesystems" reason. I'll reply more on #8557.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants