Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-job: add totalview_jobid support and misc. fixes #3130

Merged
merged 5 commits into from
Aug 21, 2020

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Aug 11, 2020

Add support for totalview_jobid and the new --debug option to flux-job and other misc. fixes.

Fixes #3108, #3110 and #2331.

@dongahn
Copy link
Member Author

dongahn commented Aug 15, 2020

This PR should be ready for your review. This has been tested with LaunchMON (LLNL/LaunchMON#51), STAT and totalview. These changes should improve those debugging tools.

This should go in first before flux-framework/flux-docs#55.

@dongahn
Copy link
Member Author

dongahn commented Aug 15, 2020

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Aug 15, 2020

Command rebase: success

Branch has been successfully rebased

@dongahn
Copy link
Member Author

dongahn commented Aug 15, 2020

Any idea why this fails in the CI?

Screen Shot 2020-08-14 at 11 11 05 PM

@grondo
Copy link
Contributor

grondo commented Aug 15, 2020 via email

@codecov-commenter
Copy link

Codecov Report

Merging #3130 into master will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3130      +/-   ##
==========================================
- Coverage   81.18%   81.15%   -0.03%     
==========================================
  Files         286      286              
  Lines       44538    44544       +6     
==========================================
- Hits        36159    36151       -8     
- Misses       8379     8393      +14     
Impacted Files Coverage Δ
src/cmd/flux-job.c 85.31% <100.00%> (+0.07%) ⬆️
src/common/libpmi/simple_client.c 86.33% <0.00%> (-3.28%) ⬇️
src/common/libpmi/pmi.c 91.50% <0.00%> (-1.89%) ⬇️
src/broker/module.c 75.39% <0.00%> (-1.59%) ⬇️
src/broker/modservice.c 70.67% <0.00%> (-0.76%) ⬇️
src/modules/resource/acquire.c 65.06% <0.00%> (-0.69%) ⬇️
src/broker/broker.c 74.00% <0.00%> (-0.11%) ⬇️
src/common/libsubprocess/subprocess.c 87.47% <0.00%> (+0.32%) ⬆️
src/common/libsubprocess/local.c 79.93% <0.00%> (+0.34%) ⬆️
src/broker/runat.c 84.58% <0.00%> (+0.39%) ⬆️

@dongahn
Copy link
Member Author

dongahn commented Aug 20, 2020

Finally everything is green.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A couple very minor suggestions inline, but feel free to take or leave them.

One question, we still need both --debug and hidden --debug-emulate options because debug-emulate also sends SIGCONT to tasks after sync event?

Comment on lines 2070 to 2074
if (optparse_hasopt (ctx.p, "debug-emulate"))
MPIR_being_debugged = 1;
if (MPIR_being_debugged)
if (optparse_hasopt (ctx.p, "debug"))
MPIR_being_debugged = 1;
if (MPIR_being_debugged) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor suggestion: would it be clearer if these 3 separate conditionals were collapsed into one?
E.g.:

    if (optparse_hasopt (ctx.p, "debug")
        || optparse_hasopt (ctx.p, "debug-emulate")) {
        MPIR_being_debugged = 1;
        // Rest of MPIR_being_debugged block...
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good suggestion for better readability. I will change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW,

// Rest of MPIR_being_debugged block...

This shouldn't go into the conditional. Because a real case is when this is set by the debugger when --debug* is not given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that is a subtlety I hadn't considered. Thanks!

parse_totalview_jobid() {
outfile=$1 &&
jobid=$(cat ${outfile} | grep totalview_jobid | \
awk '{ print $2 }' | awk -F= '{ print $2 }') &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tab and space here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically start the wrapped line with no tab. But yes if this is a convention used in flux-core, I will change it.

Comment on lines 89 to 92
flux_job_attach() {
flux job attach -vv --debug ${1} 2> ${2} &
${waitfile} -v -t 2.5 --pattern="totalview_jobid" ${2}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the test file uses 8-space indents, might as well be consistent here.

@dongahn
Copy link
Member Author

dongahn commented Aug 21, 2020

One question, we still need both --debug and hidden --debug-emulate options because debug-emulate also sends SIGCONT to tasks after sync event?

Exactly!

Problem: Debuggers such as TotalView and STAT require
a symbol named totalview_jobid in the process address
space of the starter process to fetch and to use this
target jobid for bulk tool daemon launching.

Add totalview_jobid of char* type and fill
it with the jobid of C-string when MPIR_being_debug
is set.
Problem: When a debugger fails to attach to a job
that is not running, flux-job currently prints
"flux-job: Invalid job state (INACTIVE) for debugging".
The message is not end user friendly.

Use "cannot debug job that isn't running."

Don't use errno since it is informative.
Problem: Debuggers behave much better when they attach to a
running job by attaching to a starter process.
Currently, this can be achieved by starting flux job attach
with the --debug-emulate switch to become the starter
process that debuggers interface,
but it is a hidden option and perhaps more importatnly
this also has a side effect:
sending an additional SIGCONT to the job
to faciliate our testing.

Add --debug flag as user visible option.

Make it enable MPIR in the same way as --debug-emulate
but does not send the additional SIGCONT to
the job.
Make sure flux job attach --debug generates
totalview_jobid in the same way as --debug-emulate
option but does not continue the target
parallel program.
@dongahn
Copy link
Member Author

dongahn commented Aug 21, 2020

@grondo: Addressed all the comments and took liberty squashing them.

@grondo
Copy link
Contributor

grondo commented Aug 21, 2020

Ok, feel free to set MWP @dongahn!

@mergify mergify bot merged commit 96b3edc into flux-framework:master Aug 21, 2020
@garlick garlick mentioned this pull request Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

True attach mode for debugging a running job
3 participants