-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
submit/job/jsc: propagate gpu request information #1480
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1480 +/- ##
==========================================
- Coverage 78.77% 78.72% -0.06%
==========================================
Files 164 164
Lines 30328 30355 +27
==========================================
+ Hits 23891 23896 +5
- Misses 6437 6459 +22
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple necessary changes. I'm surprised the issue with the pack/unpack
format wasn't caught by any tests.
If you'd like I'll throw together a basic sanity test for the --ngpus
option and push it onto this PR?
src/bindings/lua/wreck.lua
Outdated
@@ -289,6 +290,11 @@ function wreck:parse_cmdline (arg) | |||
self.ncores = self.ntasks | |||
end | |||
|
|||
self.ngpus = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this line is necessary due to
ngpus = self.ngpus or 0
below. However, it doesn't hurt so it can stay for now.
@@ -180,6 +180,7 @@ static flux_future_t *send_create_event (flux_t *h, struct wreck_job *job) | |||
"ntasks", job->ntasks, | |||
"ncores", job->ncores, | |||
"nnodes", job->nnodes, | |||
"ngpus", job->ngpus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra s:i
parameter for the "ngpus", job->gpus
pair is missing in the flux_msg_pack()
format here.
@@ -399,6 +400,7 @@ static void job_create_cb (flux_t *h, flux_msg_handler_t *w, | |||
"ntasks", &job->ntasks, | |||
"nnodes", &job->nnodes, | |||
"ncores", &job->ncores, | |||
"ngpus", &job->ngpus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra s?i
parameter (or s?:i
for consistency) is required for the "ngpus", &job->ngpus
pair in the format string for flux_request_unpack()
here as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh.. I think this explains weird sched failure I was getting yesterday.
I had to ask though. Why s?
not s
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past these request messages only included members that were actually set by the user. If not set they were assumed zero.
It may make sense now to have these members of this message be required.
src/bindings/lua/wreck.lua
Outdated
@@ -43,6 +43,7 @@ local default_opts = { | |||
['help'] = { char = 'h' }, | |||
['verbose'] = { char = 'v' }, | |||
['ntasks'] = { char = 'n', arg = "N" }, | |||
['ngpus'] = { char = 'g', arg = "g" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the options structure in the Lua code is not auto-documenting like optparse. Documentation of the --ngpus
option should also be added to wreck:usage()
.
Also consider adding to the wreckrun and submit manpages, but I think unfortunately those need a full editing pass so we can save it for later if you'd like.
Thanks @grondo for the review. The pack/unpack is my bad -- copy & paste error...Will get to those suggestions. |
@dongahn, it may be helpful to add these sanity-check tests to diff --git a/t/t2000-wreck.t b/t/t2000-wreck.t
index bac86ac..a5c780d 100755
--- a/t/t2000-wreck.t
+++ b/t/t2000-wreck.t
@@ -209,6 +209,18 @@ test_expect_success 'wreckrun: -t2 -N${SIZE} sets correct ntasks in kvs' '
n=$(flux kvs get --json ${LWJ}.ntasks) &&
test "$n" = $((${SIZE}*2))
'
+test_expect_success 'wreckrun: ngpus is 0 by default' '
+ flux wreckrun -n 2 /bin/true &&
+ LWJ=$(last_job_path) &&
+ n=$(flux kvs get --json ${LWJ}.ngpus) &&
+ test "$n" = "0"
+'
+test_expect_success 'wreckrun: -g, --ngpus sets ngpus in kvs' '
+ flux wreckrun -n 2 -g 4 /bin/true &&
+ LWJ=$(last_job_path) &&
+ n=$(flux kvs get --json ${LWJ}.ngpus) &&
+ test "$n" = "4"
+'
test_expect_success 'wreckrun: fallback to old rank.N.cores format works' '
flux wreckrun -N2 -n2 \ |
No problem, it was unfortunate we didn't get your PR merged before the job module was rewritten. I also noticed that |
My bad for not following through. As we just discussed, let's just add this here and get this merged, then I'll follow up with a PR to move the pack/unpack to |
I will take care. I have actually GPU scheduling tests in sched with butte hwloc xml + RDL so this sanity check should do it. |
I had some issues with the old PR + time constrainsts that prevented the PR to go in. Not your faults at all! |
@TWRS: is |
Or maybe N GPUs per task? |
Hmmm. One CI configuration failed with a timeout. Kick it again. |
I also took liberty to implement -g as N GPUs per task. |
Ok, looks like you addressed all my comments. Thanks! |
Merged now in hopes it helps you make progress on your flux-sched PR. If we need to change the behavior of -g, --gpus-per-task we can do that later. |
Great! Thank you, this is helpful! |
I will appreciate if this can be reviewed and merged soon. I'm having issues with some of the tests in my GPU support at flux-sched PR (flux-framework/flux-sched#313) and having this merged into master should be helpful to narrow that down.
This is a slightly modified version of @TWRS' change.
Add -g to wreck to allow for requesting gpus
Propate that request information to scheduler
through job module and jsc.