Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: add support for parallel debuggers with wreckrun #16

Merged
merged 11 commits into from
Jun 3, 2019

Conversation

grondo
Copy link
Contributor

@grondo grondo commented May 22, 2019

Still a work in progress (needs testing), but I thought it might be useful to open a PR since I'll be out the next few days.

This PR is in support of #12. It adds the --jobid option to wreckrun as well as simplified support for FLUX_WRECKRUN_JOBID_FD as needed by the proposed flux job-debug.

@dongahn
Copy link
Member

dongahn commented May 22, 2019

This is useful @grondo for coordination! Do you plan to backport PID support in your branch? If not, I can also do that as well.

@codecov-io
Copy link

codecov-io commented May 22, 2019

Codecov Report

Merging #16 into master will decrease coverage by 1.68%.
The diff coverage is 74.28%.

@@            Coverage Diff             @@
##           master      #16      +/-   ##
==========================================
- Coverage   81.87%   80.18%   -1.69%     
==========================================
  Files         325      196     -129     
  Lines       52209    35131   -17078     
==========================================
- Hits        42745    28170   -14575     
+ Misses       9464     6961    -2503
Impacted Files Coverage Δ
src/modules/wreck/wrexecd.c 77.61% <74.28%> (+0.69%) ⬆️
src/modules/barrier/barrier.c 76.55% <0%> (-2.07%) ⬇️
src/common/libflux/mrpc.c 87.55% <0%> (-1.21%) ⬇️
src/common/libflux/message.c 81.27% <0%> (-0.37%) ⬇️
src/broker/module.c 82.47% <0%> (-0.27%) ⬇️
src/modules/kvs/kvs.c 66.56% <0%> (-0.15%) ⬇️
src/bindings/lua/tests/zmsg-test.c
t/kvs/watch.c
t/kvs/issue1876.c
src/common/libzio/test/zio.c
... and 129 more

@grondo grondo force-pushed the wreckrun-job-debug-support branch 2 times, most recently from 01dc877 to d3e8aff Compare May 28, 2019 22:45
grondo added 6 commits May 30, 2019 10:44
If FLUX_WRECKRUN_JOBID_FD is set in environment, write jobid to
this file descriptor as soon as the jobid is known and close the fd.

This allows synchronization with an external command executing
wreckrun as a child process.
Add a wreck:setopt() method to allow setting options manually
in wreckrun or submit.
Add a new --jobid option to flux-wreckrun to allow running a
job across the same ranks as a previous job. By default, ntasks
is set such that one task per rank is executed.
Problem: The --detach option did not exit wreckrun when a job
scheduler is used (or without --immediate), however it still resulted
in disabling I/O, which is very confusing.

When jobs are being scheduled, treat --detach just like
--wait-until=runrequest

Fixes flux-framework#10
When stop-children-in-exec option is used, write a "proctable"
entry in KVS for each task for parallel debugger support.
@grondo grondo force-pushed the wreckrun-job-debug-support branch 2 times, most recently from 9302bf7 to 6c29c6a Compare May 31, 2019 22:59
@grondo
Copy link
Contributor Author

grondo commented May 31, 2019

I've push a couple more commits here to add basic testing and documentation of the new wreckrun --jobid option.

Given our dwindling emphasis on this older version, I'm not sure our time is best spent creating exhaustive tests for this feature, and perhaps we can think about merging this PR as is. What do you think @dongahn?

@dongahn
Copy link
Member

dongahn commented May 31, 2019

Yes we want to minimize our effort to this branch. I just used job-debug on the OpenMPI problem and that worked seamlessly! So I will minimize my CI testing effort there too. (BTW we pretty much tracked this down to a bug in OMPI).

@grondo
Copy link
Contributor Author

grondo commented May 31, 2019

You may want to first rebase your PR on top of this branch and ensure something I've done here doesn't inadvertently cause you trouble. If that goes well, then let's merge this PR and then you can rebase the flux job-debug PR on the new master.

@grondo
Copy link
Contributor Author

grondo commented May 31, 2019

I just used job-debug on the OpenMPI problem and that worked seamlessly!

Nice work! 👍

@grondo
Copy link
Contributor Author

grondo commented Jun 1, 2019

Hm, I forgot to also test procdesc functionality. That shouldn't take too long so I'll add that here quickly

@dongahn
Copy link
Member

dongahn commented Jun 1, 2019

Ok. But procdesc should be exercised and tested with job-debug tests. So even if you don't have tests for them, that will have some coverage.

@grondo
Copy link
Contributor Author

grondo commented Jun 1, 2019

Ok I found a couple of other fixes needed in wreck, and added just 2 sanity checks for proctable support. Later I'll expand on the commit messages and do a double check, but I think this PR will be ready to go (assuming it even passes travis)

@grondo grondo force-pushed the wreckrun-job-debug-support branch from 2375c93 to 722a32d Compare June 1, 2019 20:15
doc/man1/flux-wreckrun.adoc Outdated Show resolved Hide resolved
@dongahn
Copy link
Member

dongahn commented Jun 3, 2019

@grondo: this looks great! Extremely helpful to enable debugging. I checked one minor issue as part of my review. When that's changed (and the "WIP" is dropped from the title), I will push the merge button. And then, I will rebase PR #17 to that.

grondo added 5 commits June 3, 2019 20:11
Add short description of the new flux-wreckrun --jobid option.
Add basic sanity check for flux wreckrun --jobid.
Problem: To release jobs from the "sync" state a SIGCONT signal
needs to be sent, but flux-wreck kill will only kill a job in the
running state.

Allow jobs in "sync" state as well as "running" to be sent a signal.
Track when a job has been stopped in exec in the "sync" state and
transition to "running" when SIGCONT is sent.
Add a very simple set of tests to ensure that wreck jobs emit
per-task procdesc to kvs on '-o stop-children-in-exec' and the
wreck.<id>.proctable event.
@grondo grondo force-pushed the wreckrun-job-debug-support branch from 722a32d to 4290ddf Compare June 3, 2019 20:12
@grondo grondo changed the title WIP: wreck: add support for parallel debuggers with wreckrun wreck: add support for parallel debuggers with wreckrun Jun 3, 2019
@grondo
Copy link
Contributor Author

grondo commented Jun 3, 2019

Thanks @dongahn! I just force-pushed a fix for the doc bug you found, and removed "WIP" from the title.

@dongahn dongahn merged commit d9f514d into flux-framework:master Jun 3, 2019
@grondo
Copy link
Contributor Author

grondo commented Jun 3, 2019

Thanks!

@grondo grondo deleted the wreckrun-job-debug-support branch June 3, 2019 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants