Add OPAL_ASSERT() macro to wrap assert(). #8557

awlauria · 2021-03-05T21:44:51Z

This will prevent the generation of core files on assert() when run
with '--mca opal_enable_assert_core 1'

Signed-off-by: Austen Lauria [email protected]

awlauria · 2021-03-05T22:41:51Z

Done, I made the default to 'off.'

Also added the same functionality for abort().

ibm-ompi · 2021-03-05T22:48:04Z

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/23ba4f0fad1b6c9b525d91b68d4ab97e

awlauria · 2021-03-05T23:12:01Z

I think that should about do it.

awlauria · 2021-03-05T23:12:11Z

bot:ibm:retest

rhc54

I confess I have to agree with @bosilca comment: #8556 (comment) for two reasons:

this change doesn't fully resolve the problem raised by @jsquyres. Specifically, we will still generate terabytes of core files if the application or daemon segfaults, which has been the most common source of the problem. How does this bandaid actually resolve the full problem?
I don't understand why simply setting the ulimit doesn't suffice. It is a trivial solution that was created by the Linux community precisely for this problem, and it would (unlike this PR) fully resolve the problem Why not utilize it?

Until we really have good answers to those two questions, I can't see my way to approving this PR.

awlauria · 2021-03-06T21:10:57Z

bot:aws:retest

awlauria · 2021-03-06T21:14:55Z

@bosilca @rhc54 this is default to off. So 99.9% of users will not be wondering why a core is not generated, unless they explicitly run with this mca parameter. Also, in the case where they did run with this --mca, if an assert or abort is hit, it tells the users to re-run without this mca to generate a core (so they shouldn't be confused, unless that output is somehow lost).

@jsquyres can you comment why you can't set ulimit to 0? Perhaps this is a shared system with others who presumably don't want this to be set to 0?

I did this as a result of discussion of #8493

So if we don't want this, we should merge 8493 immediately. Though honestly I don't see any downside to this change.

awlauria · 2021-03-06T21:19:32Z

@rhc54 you are correct in that it won't solve this problem for a segv, but this does fix the abort() and assert() case, which I believe @jsquyres was hitting in his MTT (please correct me if I am wrong, however). If we also want to cover the case of an internal segmentation fault, more work would have to be done.

This will prevent the generation of core files on assert() when run with '--mca opal_enable_assert_core 1' Signed-off-by: Austen Lauria <[email protected]>

This will prevent the generation of core files on abort() when run with '--mca opal_enable_assert_core 1' Signed-off-by: Austen Lauria <[email protected]>

rhc54 · 2021-03-06T21:41:49Z

shared system with others who presumably don't want this to be set to 0?

ulimit is set on a per-user basis. All @jsquyres has to do is create an "mttuser" whose default login sets ulimit to block core files

he abort() and assert() case, which I believe @jsquyres was hitting in his MTT (please correct me if I am wrong, however)

The problem Jeff has had in the past is with segfaults, not abort/assert. He shut off his MTT runs because the one-sided tests were segfaulting and generating tons of core files. This PR does nothing to solve that problem.

honestly I don't see any downside to this change.

Not a downside per-se, but it does introduce a lot of extra abstraction that serves no discernible purpose. It would be better if we could get my two questions addressed and then see if this makes any sense.

So if we don't want this, we should merge 8493 immediately.

I fail to understand the linkage here. If this discussion is the only thing holding #8493, then we should merge it immediately. @jsquyres isn't running any MTT at this time, so the entire core file question is moot right now - and will be until he decides to restart his MTT runs.

bwbarrett · 2021-03-08T03:26:30Z

If we're worried about environment configurations and dropping core files, it seems way easier to just call struct rlimit nocore = {0, 0}; setrlimit(RLIMIT_CORE, &nocore); somewhere early in MPI_INIT based on an MCA parameter?

I agree with Ralph and George; this is not the way to solve the core file problem and having yet another wrapper for simple system functions is long term maintenance pain.

awlauria · 2021-03-08T14:45:26Z

I'm not sure I understand the complexity. It's very simple and easily accessible/searchable in the code.

Understood that this may not completely solve the core file issue, and there are other approaches to the problem.
However, that said, this is still a worth-while change in that this allows us to more easily optimize the assert() call. It opens up several possible optimizations that are fairly easy, and I can append them to this PR if desired:

Compile out assert in optimized builds (hard hammer approach)
Compile out asserts in release tarball builds.
Have a configure option to optimize out asserts (such as --no-assert), but leave in all builds by default.
Replace assert with __builtin_expect + __builtin_trap if they are available (this combo generates fewer assembly while still generating a core). We can combine this with any of the above.

Yeah it's nit-picky, but it is very easy for someone to overlook an assert in a critical code path as being a no-op. Yes it's 'just an assert', but there is overhead there.

Unless this or similar is already done in optimized builds - I would need to check (it may be already).

rhc54 · 2021-03-08T15:02:14Z

Unless this or similar is already done in optimized builds - I would need to check (it may be already).

Last I checked, assert is defined out when optimizing. Like I said, this seems like a lot of extra code that doesn't really add anything. Doesn't harm anything - just hard to understand why we are doing this.

awlauria · 2021-03-08T15:19:36Z

Good to know on it being optimized out.

The main point of the PR was to address @jsquyres concerns/suggestions to optionally generate an assert using an mca parameter. I just went ahead and implemented it since it was easy enough to do.

bwbarrett · 2021-03-08T15:52:01Z

In proof that we really have built everything already, we already have an MCA parameter to avoid dropping a core file. --mca opal_set_max_sys_limits core:0 will set the core size limit to 0 and prevent dropping core files in an abort, assert, or any of the other errors that may drop a core file.

It's likely @jsquyres forgot about the sys_limits when he brought up this proposal. I think we should close this PR and encourage Jeff to use the sys_limits MCA parameter to accomplish the same thing, without all the unexpected side effects.

jsquyres · 2021-03-08T16:53:22Z

No, I didn't forget about ulimit / sys limits. 😄

There's multiple issues at play here.

This PR came up because of my comments on #8493 (comment). I was objecting to the use of abort() on that PR, particularly for run-time issues (e.g., running out of memory). I.e., legit run-time errors that are not Open MPI coding errors can end up dropping core, and I think that's anti-social.

Specifically, I objected to these uses of abort() on that PR for 2 reasons:

Such behavior can lead to lots of corefiles, even in MTT scenarios, and even for end users.
For at least some of the run-time errors on btl/base: add subsystem to support for active-message based RDMA/atomic emulation #8493, they didn't necessarily indicate a coding error in Open MPI -- so we should return an MPI error (not just abort the job), and let the user decide what to do. I.e., I don't think we should be calling abort() at all in at least some of the cases on btl/base: add subsystem to support for active-message based RDMA/atomic emulation #8493.
- I'm surprised the ULFM folks didn't chime in here (@bosilca @abouteiller)

For MTT types of scenarios, there's two kinds of corefile-inducing errors that generally show up:

Something that is very repeatable -- it happens every run. MPI one-sided functionality had a segv for months, and caused me to shut down Cisco MTT until it was fixed.
- Note: this particular error was a segv, not an assert() fail or an explicit abort().
Something that only happens periodically -- it may even be difficult to reproduce.

For the first reason, it would be great to be able to disable corefiles in the abort() or assert() failure cases. I don't need my filesystem filling up with corefiles for something that is easily repeatable. Also, consider users who run at scale -- what if they drop a corefile into a shared filesystem for every MPI process?

Yes, I hear you saying "but just use ulimit -c 0!". 😄

For the second reason, I have Cisco's MTT set to ulimit -c unlimited. There have definitely been times when it has been useful to have a corefile to go back and look at after the fact.

One of the main values of Cisco's MTT has been scale: it runs 10s of thousands of tests. If you run tests enough times, even rare things happen. Meaning: I don't want repeatable corefiles. If it's repeatable, a dev can run the job again and observe the error. But I do want corefiles for the difficult-to-find errors.

My suggestion about being able to disable abort() or assert() failure corefiles was to support this worldview: keep ulimit -c unlimited so that I can still get corefiles for the difficult cases, but not get corefiles for the simple/easy/repeatable cases.

I still object to the uses of abort() on #8493, but that wasn't the entire reason for @awlauria making this PR.

rhc54 · 2021-03-08T16:58:53Z

But Jeff, you cannot know when an error is going to be easily repeatable and when it is going to be something you want to keep. All you can do is:

default to not having corefiles dumped to protect yourself in case something is going to blow up every test
once you have identified a specific situation that you want to investigate, then run that specific test with corefiles enabled.

The second step is always going to be a manual one as you cannot predict when it is going to be needed. So this whole discussion doesn't make any sense in practice. What am I missing? How are you going to know when to enable corefiles in advance of running MTT, and how are you going to control it on a test-by-test basis (and why couldn't you do it with an envar for that test)?

bosilca · 2021-03-08T17:21:24Z

@jsquyres the solution proposed by @bwbarrett, aka. mpirun --mca opal_set_max_sys_limits core:0, seem to be exactly what you need. Or are we missing something?

awlauria · 2021-03-08T20:52:00Z

Closed per discussion on RM call.

awlauria force-pushed the opal_assert branch 5 times, most recently from 76c43ca to 0fcf2a4 Compare March 5, 2021 22:40

awlauria requested a review from jsquyres March 5, 2021 22:41

awlauria force-pushed the opal_assert branch from 0fcf2a4 to 8b52946 Compare March 5, 2021 22:44

awlauria force-pushed the opal_assert branch from 8b52946 to 2583f4a Compare March 5, 2021 23:05

awlauria requested a review from rhc54 March 5, 2021 23:11

awlauria force-pushed the opal_assert branch 2 times, most recently from 0740b72 to 7906e9c Compare March 6, 2021 00:33

rhc54 requested changes Mar 6, 2021

View reviewed changes

awlauria added 2 commits March 6, 2021 16:26

Add OPAL_ASSERT() macro to wrap assert().

b6a55ac

This will prevent the generation of core files on assert() when run with '--mca opal_enable_assert_core 1' Signed-off-by: Austen Lauria <[email protected]>

Add OPAL_ABORT() macro to wrap abort().

bd7b32a

This will prevent the generation of core files on abort() when run with '--mca opal_enable_assert_core 1' Signed-off-by: Austen Lauria <[email protected]>

awlauria force-pushed the opal_assert branch from 7906e9c to bd7b32a Compare March 6, 2021 21:27

rhc54 mentioned this pull request Mar 6, 2021

btl/base: add subsystem to support for active-message based RDMA/atomic emulation #8493

Merged

awlauria closed this Mar 8, 2021

awlauria deleted the opal_assert branch March 16, 2021 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OPAL_ASSERT() macro to wrap assert(). #8557

Add OPAL_ASSERT() macro to wrap assert(). #8557

awlauria commented Mar 5, 2021 •

edited

Loading

awlauria commented Mar 5, 2021

ibm-ompi commented Mar 5, 2021

awlauria commented Mar 5, 2021

awlauria commented Mar 5, 2021

rhc54 left a comment

awlauria commented Mar 6, 2021

awlauria commented Mar 6, 2021 •

edited

Loading

awlauria commented Mar 6, 2021

rhc54 commented Mar 6, 2021

bwbarrett commented Mar 8, 2021 •

edited

Loading

awlauria commented Mar 8, 2021 •

edited

Loading

rhc54 commented Mar 8, 2021

awlauria commented Mar 8, 2021

bwbarrett commented Mar 8, 2021

jsquyres commented Mar 8, 2021

rhc54 commented Mar 8, 2021

bosilca commented Mar 8, 2021

awlauria commented Mar 8, 2021

Add OPAL_ASSERT() macro to wrap assert(). #8557

Add OPAL_ASSERT() macro to wrap assert(). #8557

Conversation

awlauria commented Mar 5, 2021 • edited Loading

awlauria commented Mar 5, 2021

ibm-ompi commented Mar 5, 2021

awlauria commented Mar 5, 2021

awlauria commented Mar 5, 2021

rhc54 left a comment

Choose a reason for hiding this comment

awlauria commented Mar 6, 2021

awlauria commented Mar 6, 2021 • edited Loading

awlauria commented Mar 6, 2021

rhc54 commented Mar 6, 2021

bwbarrett commented Mar 8, 2021 • edited Loading

awlauria commented Mar 8, 2021 • edited Loading

rhc54 commented Mar 8, 2021

awlauria commented Mar 8, 2021

bwbarrett commented Mar 8, 2021

jsquyres commented Mar 8, 2021

rhc54 commented Mar 8, 2021

bosilca commented Mar 8, 2021

awlauria commented Mar 8, 2021

awlauria commented Mar 5, 2021 •

edited

Loading

awlauria commented Mar 6, 2021 •

edited

Loading

bwbarrett commented Mar 8, 2021 •

edited

Loading

awlauria commented Mar 8, 2021 •

edited

Loading