GPU scheduling support #313

dongahn · 2018-04-13T23:25:39Z

Note that this will fail in Travis because it needs @trws mod in flux-core. I will submit a PR to flux-core soon for that. This addresses the overscheduling issue (#311) that arises when running multiple ranks on a node.

dongahn · 2018-04-13T23:49:44Z

Just posted an experimental PR to flux-core: flux-framework/flux-core#1465

SteVwonder · 2018-04-14T00:14:30Z

Just to confirm that I understand the impact of 17a7129: is the new default behavior when launching multiple brokers per node to oversubscribe the resources on the node (with the over-subscription factor == the number of brokers on the node)?

f542e78 LGTM!

dongahn · 2018-04-14T00:43:47Z

Thanks @SteVwonder for the quick review.

Just to confirm that I understand the impact of 17a7129: is the new default behavior when launching multiple brokers per node to oversubscribe the resources on the node (with the over-subscription factor == the number of brokers on the node)?

Depends.

If we use hwloc reader mode only, yes, that will be the behavior. From the scheduler's point of view, it behaves as if it has Fx more resources where F is the oversubscription factor. This is good for testing.
If we use rdl reader mode, even if you have multiple ranks on the matching node, the scheduler will have only 1x resource. But execution requests will be round-robined across those equivalent ranks.
If we use rdl reader mode mode, but rdl isn't consistent w/ hwloc data, resource loading will revert to hwloc reader mode. So the behavior will be 1.

Admittedly complex... but we have introduced lots of different modes over time...

dongahn · 2018-04-14T05:11:39Z

Thanks to @grondo, I now have a working butte rdl with 4 GPU. So I took liberty of adding his file and other tests around it. I believe tjis is ready for full review.

Note this shouldn't go in until experimental PR to flux-core: flux-framework/flux-core#1465 lands.

SteVwonder

Thanks @dongahn! This is a huge boost in functionality.

I just have two minor nits. They are totally optional given the importance of merging this PR.

SteVwonder · 2018-04-16T16:03:33Z

sched/sched.c

+            Jadd_int64 (child_gpu, "req_qty", job->req->ngpus);
+            /* setting size == 1 devotes (all of) the gpu to the job */
+            Jadd_int64 (child_gpu, "req_size", 1);
+            /* setting exclusive to true prevents multiple jobs per core */


"core" -> "gpu"

Thanks. I will fix it.

SteVwonder · 2018-04-16T16:06:25Z

sched/rsreader.c

@@ -276,10 +280,15 @@ int rsreader_hwloc_load (resrc_api_ctx_t *rsapi, const char *buf, size_t len,
        const char *s = rs2rank_get_digest (sig);
        if (!resrc_generate_hwloc_resources (rsapi, topo, s, err_str))
            goto err;
+
+        free (aux);


I believe this free is redundant given the free at the end of the function.

Agreed. Will fix! Thanks.

dongahn · 2018-04-20T22:22:47Z

@SteVwonder: I updated this PR against the new flux-core PR: flux-framework/flux-core#1480

But somehow t2002-easy.t is failing again. If you have a moment, can you take a brief look at this and see what might have gone wrong?

FAIL: t2002-easy.t 3 - jobs scheduled in correct order

dongahn · 2018-04-21T00:10:47Z

FYI the flux core PR has been merged so looking into the test failure on easy backfill should be a bit easier...

coveralls · 2018-04-21T03:01:26Z

Coverage increased (+0.2%) to 75.956% when pulling 28164da on dongahn:gpu_support into 511dbe0 on flux-framework:master.

SteVwonder · 2018-04-21T03:57:23Z

I spent a few minutes on this before bed. I tried running make check to reproduce the error (using flux-core/master and flux-sched/gpu-support) and got:

RROR: t0002-waitjob.t - missing test plan
ERROR: t0002-waitjob.t - exited with status 1
ERROR: t0003-basic-install.t - missing test plan
ERROR: t0003-basic-install.t - exited with status 1
ERROR: t0004-rdltool.t - missing test plan
ERROR: t0004-rdltool.t - exited with status 1
ERROR: t1000-jsc.t - missing test plan
ERROR: t1000-jsc.t - exited with status 1
ERROR: t1001-rs2rank-basic.t - missing test plan
ERROR: t1001-rs2rank-basic.t - exited with status 1
ERROR: t1002-rs2rank-64ranks.t - missing test plan
ERROR: t1002-rs2rank-64ranks.t - exited with status 1

I will dig into this further tomorrow morning.

dongahn · 2018-04-21T04:21:54Z

Hmmm. I haven't seen these errors. I checked the output from a CI test, and it seemed only tests that failed are emulator tests...

0;32mPASS�[m: t2001-fcfs-aware.t 1 - sim: started successfully
�[0;31mFAIL�[m: t2001-fcfs-aware.t 2 - sim: scheduled and ran all jobs
�[0;31mFAIL�[m: t2001-fcfs-aware.t 3 - jobs scheduled in correct order
�[0;32mPASS�[m: t2001-fcfs-aware.t 4 - sim: unloaded
�[0;35mERROR�[m: t2001-fcfs-aware.t - exited with status 1
�[0;32mPASS�[m: t2002-easy.t 1 - sim: started successfully
�[0;32mPASS�[m: t2002-easy.t 2 - sim: scheduled and ran all jobs
�[0;31mFAIL�[m: t2002-easy.t 3 - jobs scheduled in correct order
�[0;32mPASS�[m: t2002-easy.t 4 - sim: unloaded
�[0;35mERROR�[m: t2002-easy.t - exited with status 1
�[0;32mPASS�[m: t2003-fcfs-inorder.t 1 - sim: started successfully with queue-depth=1
�[0;31mFAIL�[m: t2003-fcfs-inorder.t 2 - sim: scheduled and ran all jobs with queue-depth=1
�[0;32mPASS�[m: t2003-fcfs-inorder.t 3 - jobs scheduled in correct order with queue-depth=1
�[0;32mPASS�[m: t2003-fcfs-inorder.t 4 - sim: unloaded
�[0;35mERROR�[m: t2003-fcfs-inorder.t - exited with status 1

SteVwonder · 2018-04-21T17:39:24Z

The errors that I posted were my bad. I didn't have my luarocks module loaded, so lua.posix couldn't be imported.

I think I found out why the easy-backfill test is failing. It seems that the walltime for jobs is always 0:

JscEvent being queued - JSON: {"jobid": 1, "state-pair": {"ostate": 1, "nstate": 3}, "rdesc": {"nnodes": 1, "ntasks": 16, "ncores": 16, "ngpus": 0, "walltime": 0}}, errnum: 0
[cut]
JscEvent being queued - JSON: {"jobid": 2, "state-pair": {"ostate": 1, "nstate": 3}, "rdesc": {"nnodes": 100, "ntasks": 1600, "ncores": 1600, "ngpus": 0, "walltime": 0}}, errnum: 0
[cut]
JscEvent being queued - JSON: {"jobid": 3, "state-pair": {"ostate": 1, "nstate": 3}, "rdesc": {"nnodes": 100, "ntasks": 1600, "ncores": 1600, "ngpus": 0, "walltime": 0}}, errnum: 0

My guess is a problem in the simulator with JSON unpacking. If so, I should have this fixed shortly.

dongahn · 2018-04-21T17:42:46Z

@SteVwonder: oh great! Thank you for looking into this (and Sat)!

SteVwonder · 2018-04-21T18:08:30Z

Ok. Problem resolved with a small fix in JSC. There was a missing s:i in the unpack format string in the get_submit_jcb function. I created a PR in flux-core: flux-framework/flux-core#1482

SteVwonder · 2018-04-21T18:10:34Z

simulator/simulator.h

@@ -46,6 +46,7 @@ typedef struct {
    double time_limit;
    int nnodes;
    int ncpus;
+    int ngpus;


One small thing that I noticed when looking into the easy backfilling failure, this variable is not initialized in the blank_job function in simulator.c (I should have called the function new_job, my apologies on the poor naming there). Do you mind adding/amending a commit to initialize this to 0 in blank_job?

No problem. Will force a push.

dongahn · 2018-04-21T18:15:23Z

Ok. Problem resolved with a small fix in JSC. There was a missing s:i in the unpack format string in the get_submit_jcb function. I created a PR in flux-core: flux-framework/flux-core#1482

Oops. Thank you for catching this so quickly.

codecov-io · 2018-04-21T18:23:39Z

Codecov Report

Merging #313 into master will increase coverage by 0.11%.
The diff coverage is 97.22%.

@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
+ Coverage   74.14%   74.25%   +0.11%     
==========================================
  Files          49       49              
  Lines        9511     9540      +29     
==========================================
+ Hits         7052     7084      +32     
+ Misses       2459     2456       -3

Impacted Files	Coverage Δ
simulator/submitsrv.c	`79% <ø> (ø)`	⬆️
sched/sched.c	`73.63% <100%> (+1.14%)`	⬆️
sched/rs2rank.c	`93.27% <100%> (-1.56%)`	⬇️
simulator/simulator.c	`90.79% <100%> (+0.03%)`	⬆️
sched/rsreader.c	`96.02% <87.5%> (-0.51%)`	⬇️
sched/flux-waitjob.c	`84.42% <0%> (-1.64%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 511dbe0...28164da. Read the comment docs.

dongahn · 2018-04-21T18:24:31Z

OK. pushed.

SteVwonder · 2018-04-21T22:02:37Z

simulator/simulator.c

@@ -137,6 +137,7 @@ job_t *blank_job ()
    job->time_limit = 0;
    job->nnodes = 0;
    job->ncpus = 0;
+    job->ngpu = 0;


CC libflux_sim_la-simulator.lo ../../../simulator/simulator.c: In function ‘blank_job’: ../../../simulator/simulator.c:140:8: error: ‘job_t’ has no member named ‘ngpu’ job->ngpu = 0;

Looks like travis failed on compile on this line. This should be job->ngpus.

dongahn · 2018-04-21T22:05:24Z

working while fishing is never good.... i will do this after come home. sorry.

In hwloc reader mode under multiple ranks on a node, the hwloc data reported from the these ranks are exactly identical. In this configuration, rs2rank groups those ranks as a equivalent set and round-robins across for them for execution requests. This is the correct semantics when we use rdl reader since we use this hwloc data only to link the resrc data with those equivalent ranks. However, when we use hwloc reader only mode, it is incorrect. We should rather treat each rank as a distinct resource set to facilitate testing. Fix this issue by introducing an auxilliary field to the signature input. Allow each reported, identical hwloc data to generate a different signature using this field.

Merge @TWRS' change and augment it. Propagte the gpu request information received from flux submit to the request input object for scheduling. gpu becomes a constraint for resrc's resource type matching logic.

@grondo

Written by @grondo.

dongahn · 2018-04-22T15:55:31Z

@SteVwonder: Should be ready to go in. Thank you for all the help. I know you are busy.

trws · 2018-04-23T22:52:49Z

This looks pretty good. Am I reading correctly that this works in the sched end, but is currently not hooked up to gpu requests from core wreck/submit?

dongahn · 2018-04-23T23:10:14Z

@trws: It should be hooked up to the gnu request from core wreck/submit. Try the current core master.

flux-framework/flux-core#1480
flux-framework/flux-core#1482

trws · 2018-04-23T23:10:54Z

Got it. Glad to be out of date for once, thanks for all the hard work on this @dongahn and @SteVwonder!

dongahn · 2018-04-23T23:12:16Z

Thank you for your initial code as well @trws!

dongahn requested a review from SteVwonder April 13, 2018 23:49

dongahn mentioned this pull request Apr 14, 2018

[splash] support for GPU scheduling #311

Closed

SteVwonder reviewed Apr 16, 2018

View reviewed changes

dongahn mentioned this pull request Apr 20, 2018

submit/job/jsc: propagate gpu request information flux-framework/flux-core#1480

Merged

dongahn force-pushed the gpu_support branch 2 times, most recently from 54577a5 to fd168b0 Compare April 20, 2018 22:13

SteVwonder reviewed Apr 21, 2018

View reviewed changes

dongahn force-pushed the gpu_support branch from fd168b0 to 2cb45d2 Compare April 21, 2018 18:23

SteVwonder reviewed Apr 21, 2018

View reviewed changes

dongahn force-pushed the gpu_support branch from 2cb45d2 to cee5d7c Compare April 22, 2018 04:26

dongahn added 4 commits April 21, 2018 21:27

sched: Add GPU scheduling

140794f

Merge @TWRS' change and augment it. Propagte the gpu request information received from flux submit to the request input object for scheduling. gpu becomes a constraint for resrc's resource type matching logic.

test data: Add butte hwloc xml files

6db6eb9

test data: Add rdl file that matches with butte hwloc data

5e94ca8

Written by @grondo.

test: Add GPU scheduling test

28164da

dongahn force-pushed the gpu_support branch from cee5d7c to 28164da Compare April 22, 2018 04:28

SteVwonder merged commit 3ed8516 into flux-framework:master Apr 22, 2018

grondo mentioned this pull request May 11, 2018

Need 0.5.0 Release #340

Closed

dongahn deleted the gpu_support branch July 13, 2019 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU scheduling support #313

GPU scheduling support #313

dongahn commented Apr 13, 2018

dongahn commented Apr 13, 2018

SteVwonder commented Apr 14, 2018 •

edited

Loading

dongahn commented Apr 14, 2018

dongahn commented Apr 14, 2018

SteVwonder left a comment

SteVwonder Apr 16, 2018

dongahn Apr 16, 2018

SteVwonder Apr 16, 2018

dongahn Apr 16, 2018

dongahn Apr 16, 2018

dongahn Apr 16, 2018

dongahn Apr 16, 2018

dongahn commented Apr 20, 2018

dongahn commented Apr 21, 2018

coveralls commented Apr 21, 2018 •

edited

Loading

SteVwonder commented Apr 21, 2018

dongahn commented Apr 21, 2018

SteVwonder commented Apr 21, 2018

dongahn commented Apr 21, 2018

SteVwonder commented Apr 21, 2018

SteVwonder Apr 21, 2018

dongahn Apr 21, 2018

dongahn commented Apr 21, 2018

codecov-io commented Apr 21, 2018 •

edited

Loading

dongahn commented Apr 21, 2018

SteVwonder Apr 21, 2018 •

edited

Loading

dongahn commented Apr 21, 2018

dongahn commented Apr 22, 2018

trws commented Apr 23, 2018

dongahn commented Apr 23, 2018

trws commented Apr 23, 2018

dongahn commented Apr 23, 2018

GPU scheduling support #313

GPU scheduling support #313

Conversation

dongahn commented Apr 13, 2018

dongahn commented Apr 13, 2018

SteVwonder commented Apr 14, 2018 • edited Loading

dongahn commented Apr 14, 2018

dongahn commented Apr 14, 2018

SteVwonder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongahn commented Apr 20, 2018

dongahn commented Apr 21, 2018

coveralls commented Apr 21, 2018 • edited Loading

SteVwonder commented Apr 21, 2018

dongahn commented Apr 21, 2018

SteVwonder commented Apr 21, 2018

dongahn commented Apr 21, 2018

SteVwonder commented Apr 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongahn commented Apr 21, 2018

codecov-io commented Apr 21, 2018 • edited Loading

Codecov Report

dongahn commented Apr 21, 2018

SteVwonder Apr 21, 2018 • edited Loading

Choose a reason for hiding this comment

dongahn commented Apr 21, 2018

dongahn commented Apr 22, 2018

trws commented Apr 23, 2018

dongahn commented Apr 23, 2018

trws commented Apr 23, 2018

dongahn commented Apr 23, 2018

SteVwonder commented Apr 14, 2018 •

edited

Loading

coveralls commented Apr 21, 2018 •

edited

Loading

codecov-io commented Apr 21, 2018 •

edited

Loading

SteVwonder Apr 21, 2018 •

edited

Loading