-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OPAL_ASSERT() macro to wrap assert(). #8557
Conversation
76c43ca
to
0fcf2a4
Compare
Done, I made the default to 'off.' Also added the same functionality for abort(). |
The IBM CI (XL) build failed! Please review the log, linked below. Gist: https://gist.github.com/23ba4f0fad1b6c9b525d91b68d4ab97e |
I think that should about do it. |
bot:ibm:retest |
0740b72
to
7906e9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confess I have to agree with @bosilca comment: #8556 (comment) for two reasons:
- this change doesn't fully resolve the problem raised by @jsquyres. Specifically, we will still generate terabytes of core files if the application or daemon segfaults, which has been the most common source of the problem. How does this bandaid actually resolve the full problem?
- I don't understand why simply setting the
ulimit
doesn't suffice. It is a trivial solution that was created by the Linux community precisely for this problem, and it would (unlike this PR) fully resolve the problem Why not utilize it?
Until we really have good answers to those two questions, I can't see my way to approving this PR.
bot:aws:retest |
@bosilca @rhc54 this is default to off. So 99.9% of users will not be wondering why a core is not generated, unless they explicitly run with this mca parameter. Also, in the case where they did run with this --mca, if an assert or abort is hit, it tells the users to re-run without this mca to generate a core (so they shouldn't be confused, unless that output is somehow lost). @jsquyres can you comment why you can't set ulimit to 0? Perhaps this is a shared system with others who presumably don't want this to be set to 0? I did this as a result of discussion of #8493 So if we don't want this, we should merge 8493 immediately. Though honestly I don't see any downside to this change. |
@rhc54 you are correct in that it won't solve this problem for a segv, but this does fix the abort() and assert() case, which I believe @jsquyres was hitting in his MTT (please correct me if I am wrong, however). If we also want to cover the case of an internal segmentation fault, more work would have to be done. |
This will prevent the generation of core files on assert() when run with '--mca opal_enable_assert_core 1' Signed-off-by: Austen Lauria <[email protected]>
This will prevent the generation of core files on abort() when run with '--mca opal_enable_assert_core 1' Signed-off-by: Austen Lauria <[email protected]>
ulimit is set on a per-user basis. All @jsquyres has to do is create an "mttuser" whose default login sets ulimit to block core files
The problem Jeff has had in the past is with segfaults, not abort/assert. He shut off his MTT runs because the one-sided tests were segfaulting and generating tons of core files. This PR does nothing to solve that problem.
Not a downside per-se, but it does introduce a lot of extra abstraction that serves no discernible purpose. It would be better if we could get my two questions addressed and then see if this makes any sense.
I fail to understand the linkage here. If this discussion is the only thing holding #8493, then we should merge it immediately. @jsquyres isn't running any MTT at this time, so the entire core file question is moot right now - and will be until he decides to restart his MTT runs. |
If we're worried about environment configurations and dropping core files, it seems way easier to just call I agree with Ralph and George; this is not the way to solve the core file problem and having yet another wrapper for simple system functions is long term maintenance pain. |
I'm not sure I understand the complexity. It's very simple and easily accessible/searchable in the code. Understood that this may not completely solve the core file issue, and there are other approaches to the problem.
Yeah it's nit-picky, but it is very easy for someone to overlook an assert in a critical code path as being a no-op. Yes it's 'just an assert', but there is overhead there. Unless this or similar is already done in optimized builds - I would need to check (it may be already). |
Last I checked, |
Good to know on it being optimized out. The main point of the PR was to address @jsquyres concerns/suggestions to optionally generate an assert using an mca parameter. I just went ahead and implemented it since it was easy enough to do. |
In proof that we really have built everything already, we already have an MCA parameter to avoid dropping a core file. It's likely @jsquyres forgot about the sys_limits when he brought up this proposal. I think we should close this PR and encourage Jeff to use the sys_limits MCA parameter to accomplish the same thing, without all the unexpected side effects. |
No, I didn't forget about ulimit / sys limits. 😄 There's multiple issues at play here. This PR came up because of my comments on #8493 (comment). I was objecting to the use of Specifically, I objected to these uses of
For MTT types of scenarios, there's two kinds of corefile-inducing errors that generally show up:
For the first reason, it would be great to be able to disable corefiles in the Yes, I hear you saying "but just use For the second reason, I have Cisco's MTT set to One of the main values of Cisco's MTT has been scale: it runs 10s of thousands of tests. If you run tests enough times, even rare things happen. Meaning: I don't want repeatable corefiles. If it's repeatable, a dev can run the job again and observe the error. But I do want corefiles for the difficult-to-find errors. My suggestion about being able to disable I still object to the uses of |
But Jeff, you cannot know when an error is going to be easily repeatable and when it is going to be something you want to keep. All you can do is:
The second step is always going to be a manual one as you cannot predict when it is going to be needed. So this whole discussion doesn't make any sense in practice. What am I missing? How are you going to know when to enable corefiles in advance of running MTT, and how are you going to control it on a test-by-test basis (and why couldn't you do it with an envar for that test)? |
@jsquyres the solution proposed by @bwbarrett, aka. |
Closed per discussion on RM call. |
This will prevent the generation of core files on assert() when run
with '--mca opal_enable_assert_core 1'
Signed-off-by: Austen Lauria [email protected]