Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OPAL_ASSERT() macro to wrap assert(). #8557

Closed
wants to merge 2 commits into from

Conversation

awlauria
Copy link
Contributor

@awlauria awlauria commented Mar 5, 2021

This will prevent the generation of core files on assert() when run
with '--mca opal_enable_assert_core 1'

Signed-off-by: Austen Lauria [email protected]

@awlauria awlauria force-pushed the opal_assert branch 5 times, most recently from 76c43ca to 0fcf2a4 Compare March 5, 2021 22:40
@awlauria awlauria requested a review from jsquyres March 5, 2021 22:41
@awlauria
Copy link
Contributor Author

awlauria commented Mar 5, 2021

Done, I made the default to 'off.'

Also added the same functionality for abort().

@ibm-ompi
Copy link

ibm-ompi commented Mar 5, 2021

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/23ba4f0fad1b6c9b525d91b68d4ab97e

@awlauria
Copy link
Contributor Author

awlauria commented Mar 5, 2021

I think that should about do it.

@awlauria
Copy link
Contributor Author

awlauria commented Mar 5, 2021

bot:ibm:retest

@awlauria awlauria force-pushed the opal_assert branch 2 times, most recently from 0740b72 to 7906e9c Compare March 6, 2021 00:33
Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess I have to agree with @bosilca comment: #8556 (comment) for two reasons:

  1. this change doesn't fully resolve the problem raised by @jsquyres. Specifically, we will still generate terabytes of core files if the application or daemon segfaults, which has been the most common source of the problem. How does this bandaid actually resolve the full problem?
  2. I don't understand why simply setting the ulimit doesn't suffice. It is a trivial solution that was created by the Linux community precisely for this problem, and it would (unlike this PR) fully resolve the problem Why not utilize it?

Until we really have good answers to those two questions, I can't see my way to approving this PR.

@awlauria
Copy link
Contributor Author

awlauria commented Mar 6, 2021

bot:aws:retest

@awlauria
Copy link
Contributor Author

awlauria commented Mar 6, 2021

@bosilca @rhc54 this is default to off. So 99.9% of users will not be wondering why a core is not generated, unless they explicitly run with this mca parameter. Also, in the case where they did run with this --mca, if an assert or abort is hit, it tells the users to re-run without this mca to generate a core (so they shouldn't be confused, unless that output is somehow lost).

@jsquyres can you comment why you can't set ulimit to 0? Perhaps this is a shared system with others who presumably don't want this to be set to 0?

I did this as a result of discussion of #8493

So if we don't want this, we should merge 8493 immediately. Though honestly I don't see any downside to this change.

@awlauria
Copy link
Contributor Author

awlauria commented Mar 6, 2021

@rhc54 you are correct in that it won't solve this problem for a segv, but this does fix the abort() and assert() case, which I believe @jsquyres was hitting in his MTT (please correct me if I am wrong, however). If we also want to cover the case of an internal segmentation fault, more work would have to be done.

This will prevent the generation of core files on assert() when run
with '--mca opal_enable_assert_core 1'

Signed-off-by: Austen Lauria <[email protected]>
This will prevent the generation of core files on abort() when run
with '--mca opal_enable_assert_core 1'

Signed-off-by: Austen Lauria <[email protected]>
@rhc54
Copy link
Contributor

rhc54 commented Mar 6, 2021

shared system with others who presumably don't want this to be set to 0?

ulimit is set on a per-user basis. All @jsquyres has to do is create an "mttuser" whose default login sets ulimit to block core files

he abort() and assert() case, which I believe @jsquyres was hitting in his MTT (please correct me if I am wrong, however)

The problem Jeff has had in the past is with segfaults, not abort/assert. He shut off his MTT runs because the one-sided tests were segfaulting and generating tons of core files. This PR does nothing to solve that problem.

honestly I don't see any downside to this change.

Not a downside per-se, but it does introduce a lot of extra abstraction that serves no discernible purpose. It would be better if we could get my two questions addressed and then see if this makes any sense.

So if we don't want this, we should merge 8493 immediately.

I fail to understand the linkage here. If this discussion is the only thing holding #8493, then we should merge it immediately. @jsquyres isn't running any MTT at this time, so the entire core file question is moot right now - and will be until he decides to restart his MTT runs.

@bwbarrett
Copy link
Member

bwbarrett commented Mar 8, 2021

If we're worried about environment configurations and dropping core files, it seems way easier to just call struct rlimit nocore = {0, 0}; setrlimit(RLIMIT_CORE, &nocore); somewhere early in MPI_INIT based on an MCA parameter?

I agree with Ralph and George; this is not the way to solve the core file problem and having yet another wrapper for simple system functions is long term maintenance pain.

@awlauria
Copy link
Contributor Author

awlauria commented Mar 8, 2021

I'm not sure I understand the complexity. It's very simple and easily accessible/searchable in the code.

Understood that this may not completely solve the core file issue, and there are other approaches to the problem.
However, that said, this is still a worth-while change in that this allows us to more easily optimize the assert() call. It opens up several possible optimizations that are fairly easy, and I can append them to this PR if desired:

  • Compile out assert in optimized builds (hard hammer approach)
  • Compile out asserts in release tarball builds.
  • Have a configure option to optimize out asserts (such as --no-assert), but leave in all builds by default.
  • Replace assert with __builtin_expect + __builtin_trap if they are available (this combo generates fewer assembly while still generating a core). We can combine this with any of the above.

Yeah it's nit-picky, but it is very easy for someone to overlook an assert in a critical code path as being a no-op. Yes it's 'just an assert', but there is overhead there.

Unless this or similar is already done in optimized builds - I would need to check (it may be already).

@rhc54
Copy link
Contributor

rhc54 commented Mar 8, 2021

Unless this or similar is already done in optimized builds - I would need to check (it may be already).

Last I checked, assert is defined out when optimizing. Like I said, this seems like a lot of extra code that doesn't really add anything. Doesn't harm anything - just hard to understand why we are doing this.

@awlauria
Copy link
Contributor Author

awlauria commented Mar 8, 2021

Good to know on it being optimized out.

The main point of the PR was to address @jsquyres concerns/suggestions to optionally generate an assert using an mca parameter. I just went ahead and implemented it since it was easy enough to do.

@bwbarrett
Copy link
Member

In proof that we really have built everything already, we already have an MCA parameter to avoid dropping a core file. --mca opal_set_max_sys_limits core:0 will set the core size limit to 0 and prevent dropping core files in an abort, assert, or any of the other errors that may drop a core file.

It's likely @jsquyres forgot about the sys_limits when he brought up this proposal. I think we should close this PR and encourage Jeff to use the sys_limits MCA parameter to accomplish the same thing, without all the unexpected side effects.

@jsquyres
Copy link
Member

jsquyres commented Mar 8, 2021

No, I didn't forget about ulimit / sys limits. 😄

There's multiple issues at play here.

This PR came up because of my comments on #8493 (comment). I was objecting to the use of abort() on that PR, particularly for run-time issues (e.g., running out of memory). I.e., legit run-time errors that are not Open MPI coding errors can end up dropping core, and I think that's anti-social.

Specifically, I objected to these uses of abort() on that PR for 2 reasons:

  1. Such behavior can lead to lots of corefiles, even in MTT scenarios, and even for end users.
  2. For at least some of the run-time errors on btl/base: add subsystem to support for active-message based RDMA/atomic emulation #8493, they didn't necessarily indicate a coding error in Open MPI -- so we should return an MPI error (not just abort the job), and let the user decide what to do. I.e., I don't think we should be calling abort() at all in at least some of the cases on btl/base: add subsystem to support for active-message based RDMA/atomic emulation #8493.

For MTT types of scenarios, there's two kinds of corefile-inducing errors that generally show up:

  1. Something that is very repeatable -- it happens every run. MPI one-sided functionality had a segv for months, and caused me to shut down Cisco MTT until it was fixed.
    • Note: this particular error was a segv, not an assert() fail or an explicit abort().
  2. Something that only happens periodically -- it may even be difficult to reproduce.

For the first reason, it would be great to be able to disable corefiles in the abort() or assert() failure cases. I don't need my filesystem filling up with corefiles for something that is easily repeatable. Also, consider users who run at scale -- what if they drop a corefile into a shared filesystem for every MPI process?

Yes, I hear you saying "but just use ulimit -c 0!". 😄

For the second reason, I have Cisco's MTT set to ulimit -c unlimited. There have definitely been times when it has been useful to have a corefile to go back and look at after the fact.

One of the main values of Cisco's MTT has been scale: it runs 10s of thousands of tests. If you run tests enough times, even rare things happen. Meaning: I don't want repeatable corefiles. If it's repeatable, a dev can run the job again and observe the error. But I do want corefiles for the difficult-to-find errors.

My suggestion about being able to disable abort() or assert() failure corefiles was to support this worldview: keep ulimit -c unlimited so that I can still get corefiles for the difficult cases, but not get corefiles for the simple/easy/repeatable cases.

I still object to the uses of abort() on #8493, but that wasn't the entire reason for @awlauria making this PR.

@rhc54
Copy link
Contributor

rhc54 commented Mar 8, 2021

But Jeff, you cannot know when an error is going to be easily repeatable and when it is going to be something you want to keep. All you can do is:

  • default to not having corefiles dumped to protect yourself in case something is going to blow up every test
  • once you have identified a specific situation that you want to investigate, then run that specific test with corefiles enabled.

The second step is always going to be a manual one as you cannot predict when it is going to be needed. So this whole discussion doesn't make any sense in practice. What am I missing? How are you going to know when to enable corefiles in advance of running MTT, and how are you going to control it on a test-by-test basis (and why couldn't you do it with an envar for that test)?

@bosilca
Copy link
Member

bosilca commented Mar 8, 2021

@jsquyres the solution proposed by @bwbarrett, aka. mpirun --mca opal_set_max_sys_limits core:0, seem to be exactly what you need. Or are we missing something?

@awlauria awlauria closed this Mar 8, 2021
@awlauria
Copy link
Contributor Author

awlauria commented Mar 8, 2021

Closed per discussion on RM call.

@awlauria awlauria deleted the opal_assert branch March 16, 2021 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants