-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-job: add totalview_jobid support and misc. fixes #3130
Conversation
This PR should be ready for your review. This has been tested with LaunchMON (LLNL/LaunchMON#51), STAT and totalview. These changes should improve those debugging tools. This should go in first before flux-framework/flux-docs#55. |
@Mergifyio rebase |
Command
|
@dongahn, if you added any new tests, you may have a broken &&-chain. Run
the test with --chain-lint in case that is the problem
…On Fri, Aug 14, 2020, 11:12 PM Dong H. Ahn ***@***.***> wrote:
Any idea why this fails in the CI?
[image: Screen Shot 2020-08-14 at 11 11 05 PM]
<https://user-images.githubusercontent.com/862123/90306577-857f0e80-de83-11ea-982f-40e5145e9055.png>
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#3130 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFVEUSKDD6AXIV24C3T6NLSAYRLHANCNFSM4P3KEXSA>
.
|
Codecov Report
@@ Coverage Diff @@
## master #3130 +/- ##
==========================================
- Coverage 81.18% 81.15% -0.03%
==========================================
Files 286 286
Lines 44538 44544 +6
==========================================
- Hits 36159 36151 -8
- Misses 8379 8393 +14
|
Finally everything is green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! A couple very minor suggestions inline, but feel free to take or leave them.
One question, we still need both --debug
and hidden --debug-emulate
options because debug-emulate
also sends SIGCONT to tasks after sync
event?
src/cmd/flux-job.c
Outdated
if (optparse_hasopt (ctx.p, "debug-emulate")) | ||
MPIR_being_debugged = 1; | ||
if (MPIR_being_debugged) | ||
if (optparse_hasopt (ctx.p, "debug")) | ||
MPIR_being_debugged = 1; | ||
if (MPIR_being_debugged) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor suggestion: would it be clearer if these 3 separate conditionals were collapsed into one?
E.g.:
if (optparse_hasopt (ctx.p, "debug")
|| optparse_hasopt (ctx.p, "debug-emulate")) {
MPIR_being_debugged = 1;
// Rest of MPIR_being_debugged block...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good suggestion for better readability. I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW,
// Rest of MPIR_being_debugged block...
This shouldn't go into the conditional. Because a real case is when this is set by the debugger when --debug*
is not given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is a subtlety I hadn't considered. Thanks!
t/t2611-debug-emulate.t
Outdated
parse_totalview_jobid() { | ||
outfile=$1 && | ||
jobid=$(cat ${outfile} | grep totalview_jobid | \ | ||
awk '{ print $2 }' | awk -F= '{ print $2 }') && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing tab and space here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I typically start the wrapped line with no tab. But yes if this is a convention used in flux-core, I will change it.
t/t2611-debug-emulate.t
Outdated
flux_job_attach() { | ||
flux job attach -vv --debug ${1} 2> ${2} & | ||
${waitfile} -v -t 2.5 --pattern="totalview_jobid" ${2} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the test file uses 8-space indents, might as well be consistent here.
Exactly! |
Problem: Debuggers such as TotalView and STAT require a symbol named totalview_jobid in the process address space of the starter process to fetch and to use this target jobid for bulk tool daemon launching. Add totalview_jobid of char* type and fill it with the jobid of C-string when MPIR_being_debug is set.
Problem: When a debugger fails to attach to a job that is not running, flux-job currently prints "flux-job: Invalid job state (INACTIVE) for debugging". The message is not end user friendly. Use "cannot debug job that isn't running." Don't use errno since it is informative.
Problem: Debuggers behave much better when they attach to a running job by attaching to a starter process. Currently, this can be achieved by starting flux job attach with the --debug-emulate switch to become the starter process that debuggers interface, but it is a hidden option and perhaps more importatnly this also has a side effect: sending an additional SIGCONT to the job to faciliate our testing. Add --debug flag as user visible option. Make it enable MPIR in the same way as --debug-emulate but does not send the additional SIGCONT to the job.
Make sure flux job attach --debug generates totalview_jobid in the same way as --debug-emulate option but does not continue the target parallel program.
@grondo: Addressed all the comments and took liberty squashing them. |
Ok, feel free to set MWP @dongahn! |
Add support for totalview_jobid and the new
--debug
option toflux-job
and other misc. fixes.Fixes #3108, #3110 and #2331.