submit/job/jsc: propagate gpu request information #1480

dongahn · 2018-04-20T00:29:23Z

I will appreciate if this can be reviewed and merged soon. I'm having issues with some of the tests in my GPU support at flux-sched PR (flux-framework/flux-sched#313) and having this merged into master should be helpful to narrow that down.

This is a slightly modified version of @TWRS' change.

Add -g to wreck to allow for requesting gpus
Propate that request information to scheduler
through job module and jsc.

coveralls · 2018-04-20T01:00:42Z

Coverage decreased (-0.05%) to 79.023% when pulling 07d6f2a on dongahn:gpu_support into f8aae6c on flux-framework:master.

codecov-io · 2018-04-20T01:02:41Z

Codecov Report

Merging #1480 into master will decrease coverage by 0.05%.
The diff coverage is 81.25%.

@@            Coverage Diff             @@
##           master    #1480      +/-   ##
==========================================
- Coverage   78.77%   78.72%   -0.06%     
==========================================
  Files         164      164              
  Lines       30328    30355      +27     
==========================================
+ Hits        23891    23896       +5     
- Misses       6437     6459      +22

Impacted Files	Coverage Δ
src/modules/wreck/job.c	`73.19% <50%> (ø)`	⬆️
src/common/libjsc/jstatctl.c	`74.26% <83.33%> (+0.25%)`	⬆️
src/common/libflux/content.c	`86.66% <0%> (-3.34%)`	⬇️
src/common/libflux/ev_flux.c	`97.56% <0%> (-2.44%)`	⬇️
src/common/libflux/attr.c	`90% <0%> (-2%)`	⬇️
src/common/libflux/rpc.c	`92.56% <0%> (-1.66%)`	⬇️
src/cmd/flux-event.c	`67.74% <0%> (-1.08%)`	⬇️
src/connectors/local/local.c	`87.4% <0%> (-0.75%)`	⬇️
src/bindings/lua/flux-lua.c	`81.58% <0%> (-0.69%)`	⬇️
src/common/libflux/future.c	`88.31% <0%> (-0.47%)`	⬇️
... and 11 more

grondo

Just a couple necessary changes. I'm surprised the issue with the pack/unpack format wasn't caught by any tests.

If you'd like I'll throw together a basic sanity test for the --ngpus option and push it onto this PR?

grondo · 2018-04-20T02:21:34Z

src/bindings/lua/wreck.lua

@@ -289,6 +290,11 @@ function wreck:parse_cmdline (arg)
        self.ncores = self.ntasks
    end

+    self.ngpus = 0


I don't think this line is necessary due to

ngpus = self.ngpus or 0

below. However, it doesn't hurt so it can stay for now.

grondo · 2018-04-20T02:33:53Z

src/modules/wreck/job.c

@@ -180,6 +180,7 @@ static flux_future_t *send_create_event (flux_t *h, struct wreck_job *job)
                             "ntasks", job->ntasks,
                             "ncores", job->ncores,
                             "nnodes", job->nnodes,
+                             "ngpus", job->ngpus,


The extra s:i parameter for the "ngpus", job->gpus pair is missing in the flux_msg_pack() format here.

grondo · 2018-04-20T02:37:44Z

src/modules/wreck/job.c

@@ -399,6 +400,7 @@ static void job_create_cb (flux_t *h, flux_msg_handler_t *w,
                                          "ntasks", &job->ntasks,
                                          "nnodes", &job->nnodes,
                                          "ncores", &job->ncores,
+                                          "ngpus", &job->ngpus,


Extra s?i parameter (or s?:i for consistency) is required for the "ngpus", &job->ngpus pair in the format string for flux_request_unpack() here as above

Heh.. I think this explains weird sched failure I was getting yesterday.

I had to ask though. Why s? not s ?

In the past these request messages only included members that were actually set by the user. If not set they were assumed zero.

It may make sense now to have these members of this message be required.

grondo · 2018-04-20T02:40:52Z

src/bindings/lua/wreck.lua

@@ -43,6 +43,7 @@ local default_opts = {
    ['help']    = { char = 'h'  },
    ['verbose'] = { char = 'v'  },
    ['ntasks']  = { char = 'n', arg = "N" },
+    ['ngpus']  = { char = 'g', arg = "g" },


Unfortunately the options structure in the Lua code is not auto-documenting like optparse. Documentation of the --ngpus option should also be added to wreck:usage().

Also consider adding to the wreckrun and submit manpages, but I think unfortunately those need a full editing pass so we can save it for later if you'd like.

dongahn · 2018-04-20T03:19:48Z

Thanks @grondo for the review. The pack/unpack is my bad -- copy & paste error...Will get to those suggestions.

grondo · 2018-04-20T15:52:09Z

@dongahn, it may be helpful to add these sanity-check tests to t2000-wreck.t along with this PR. If you want I can push this commit directly on top of your branch here

diff --git a/t/t2000-wreck.t b/t/t2000-wreck.t
index bac86ac..a5c780d 100755
--- a/t/t2000-wreck.t
+++ b/t/t2000-wreck.t
@@ -209,6 +209,18 @@ test_expect_success 'wreckrun: -t2 -N${SIZE} sets correct ntasks in kvs' '
        n=$(flux kvs get --json ${LWJ}.ntasks) &&
        test "$n" = $((${SIZE}*2))
 '
+test_expect_success 'wreckrun: ngpus is 0 by default' '
+       flux wreckrun -n 2 /bin/true &&
+       LWJ=$(last_job_path) &&
+       n=$(flux kvs get --json ${LWJ}.ngpus) &&
+       test "$n" = "0"
+'
+test_expect_success 'wreckrun: -g, --ngpus sets ngpus in kvs' '
+       flux wreckrun -n 2 -g 4 /bin/true &&
+       LWJ=$(last_job_path) &&
+       n=$(flux kvs get --json ${LWJ}.ngpus) &&
+       test "$n" = "4"
+'
 
 test_expect_success 'wreckrun: fallback to old rank.N.cores format works' '
        flux wreckrun -N2 -n2 \

grondo · 2018-04-20T16:04:28Z

Thanks @grondo for the review. The pack/unpack is my bad -- copy & paste error...Will get to those suggestions.

No problem, it was unfortunate we didn't get your PR merged before the job module was rewritten.

I also noticed that ngpus field isn't pulled from the request message in flux_request_unpack() in job_submit_only(). This is why I had suggested we move the unpack calls into a single place in the wreck_job interface. Sorry about that.

garlick · 2018-04-20T16:16:11Z

I also noticed that ngpus field isn't pulled from the request message in flux_request_unpack() in job_submit_only(). This is why I had suggested we move the unpack calls into a single place in the wreck_job interface. Sorry about that.

My bad for not following through. As we just discussed, let's just add this here and get this merged, then I'll follow up with a PR to move the pack/unpack to wreck_job as originally proposed so we don't have this code duplication going forward.

dongahn · 2018-04-20T16:34:04Z

@dongahn, it may be helpful to add these sanity-check tests to t2000-wreck.t along with this PR. If you want I can push this commit directly on top of your branch here

I will take care. I have actually GPU scheduling tests in sched with butte hwloc xml + RDL so this sanity check should do it.

dongahn · 2018-04-20T16:37:08Z

it was unfortunate we didn't get your PR merged before the job module was rewritten.

My bad for not following through.

I had some issues with the old PR + time constrainsts that prevented the PR to go in. Not your faults at all!

dongahn · 2018-04-20T20:21:31Z

@TWRS: is -g submit option supposed to request N GPUs per node or a total of N GPUs? From the code, it looks to me like it requests N GPUs per node. Should the long option be named --gpus-per-node?

dongahn · 2018-04-20T20:28:50Z

Or maybe N GPUs per task?

Augmented version of @TWRS' change. Add -g to wreck to allow for requesting gpus Propate that request information to scheduler through job module and jsc.

@grondo

Incorporate @grondo's tests.

dongahn · 2018-04-20T23:07:14Z

Hmmm. One CI configuration failed with a timeout. Kick it again.

dongahn · 2018-04-20T23:13:52Z

I also took liberty to implement -g as N GPUs per task.

grondo · 2018-04-20T23:35:11Z

Ok, looks like you addressed all my comments. Thanks!

grondo · 2018-04-20T23:36:43Z

Merged now in hopes it helps you make progress on your flux-sched PR. If we need to change the behavior of -g, --gpus-per-task we can do that later.

dongahn · 2018-04-20T23:42:00Z

Great! Thank you, this is helpful!

grondo requested changes Apr 20, 2018

View reviewed changes

dongahn added 2 commits April 20, 2018 14:52

submit/job/jsc: propagate gpu request information

093b2c5

Augmented version of @TWRS' change. Add -g to wreck to allow for requesting gpus Propate that request information to scheduler through job module and jsc.

test: Add sanity tests for -g GPU support within wreck

07d6f2a

Incorporate @grondo's tests.

dongahn mentioned this pull request Apr 20, 2018

GPU scheduling support flux-framework/flux-sched#313

Merged

grondo approved these changes Apr 20, 2018

View reviewed changes

grondo merged commit 28de35f into flux-framework:master Apr 20, 2018

grondo mentioned this pull request May 10, 2018

0.9.0 Release #1479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

submit/job/jsc: propagate gpu request information #1480

submit/job/jsc: propagate gpu request information #1480

dongahn commented Apr 20, 2018

coveralls commented Apr 20, 2018 •

edited

Loading

codecov-io commented Apr 20, 2018 •

edited

Loading

grondo left a comment

grondo Apr 20, 2018

grondo Apr 20, 2018

grondo Apr 20, 2018

dongahn Apr 20, 2018

grondo Apr 20, 2018

grondo Apr 20, 2018

dongahn commented Apr 20, 2018

grondo commented Apr 20, 2018

grondo commented Apr 20, 2018

garlick commented Apr 20, 2018 •

edited

Loading

dongahn commented Apr 20, 2018

dongahn commented Apr 20, 2018

dongahn commented Apr 20, 2018 •

edited

Loading

dongahn commented Apr 20, 2018

dongahn commented Apr 20, 2018

dongahn commented Apr 20, 2018

grondo commented Apr 20, 2018

grondo commented Apr 20, 2018

dongahn commented Apr 20, 2018

submit/job/jsc: propagate gpu request information #1480

submit/job/jsc: propagate gpu request information #1480

Conversation

dongahn commented Apr 20, 2018

coveralls commented Apr 20, 2018 • edited Loading

codecov-io commented Apr 20, 2018 • edited Loading

Codecov Report

grondo left a comment

Choose a reason for hiding this comment

grondo Apr 20, 2018

Choose a reason for hiding this comment

grondo Apr 20, 2018

Choose a reason for hiding this comment

grondo Apr 20, 2018

Choose a reason for hiding this comment

dongahn Apr 20, 2018

Choose a reason for hiding this comment

grondo Apr 20, 2018

Choose a reason for hiding this comment

grondo Apr 20, 2018

Choose a reason for hiding this comment

dongahn commented Apr 20, 2018

grondo commented Apr 20, 2018

grondo commented Apr 20, 2018

garlick commented Apr 20, 2018 • edited Loading

dongahn commented Apr 20, 2018

dongahn commented Apr 20, 2018

dongahn commented Apr 20, 2018 • edited Loading

dongahn commented Apr 20, 2018

dongahn commented Apr 20, 2018

dongahn commented Apr 20, 2018

grondo commented Apr 20, 2018

grondo commented Apr 20, 2018

dongahn commented Apr 20, 2018

coveralls commented Apr 20, 2018 •

edited

Loading

codecov-io commented Apr 20, 2018 •

edited

Loading

garlick commented Apr 20, 2018 •

edited

Loading

dongahn commented Apr 20, 2018 •

edited

Loading