Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU scheduling support #313

Merged
merged 5 commits into from
Apr 22, 2018
Merged

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Apr 13, 2018

Note that this will fail in Travis because it needs @trws mod in flux-core. I will submit a PR to flux-core soon for that. This addresses the overscheduling issue (#311) that arises when running multiple ranks on a node.

@dongahn
Copy link
Member Author

dongahn commented Apr 13, 2018

Just posted an experimental PR to flux-core: flux-framework/flux-core#1465

@dongahn dongahn requested a review from SteVwonder April 13, 2018 23:49
@SteVwonder
Copy link
Member

SteVwonder commented Apr 14, 2018

Just to confirm that I understand the impact of 17a7129: is the new default behavior when launching multiple brokers per node to oversubscribe the resources on the node (with the over-subscription factor == the number of brokers on the node)?

f542e78 LGTM!

@dongahn
Copy link
Member Author

dongahn commented Apr 14, 2018

Thanks @SteVwonder for the quick review.

Just to confirm that I understand the impact of 17a7129: is the new default behavior when launching multiple brokers per node to oversubscribe the resources on the node (with the over-subscription factor == the number of brokers on the node)?

Depends.

  1. If we use hwloc reader mode only, yes, that will be the behavior. From the scheduler's point of view, it behaves as if it has Fx more resources where F is the oversubscription factor. This is good for testing.

  2. If we use rdl reader mode, even if you have multiple ranks on the matching node, the scheduler will have only 1x resource. But execution requests will be round-robined across those equivalent ranks.

  3. If we use rdl reader mode mode, but rdl isn't consistent w/ hwloc data, resource loading will revert to hwloc reader mode. So the behavior will be 1.

Admittedly complex... but we have introduced lots of different modes over time...

@dongahn
Copy link
Member Author

dongahn commented Apr 14, 2018

Thanks to @grondo, I now have a working butte rdl with 4 GPU. So I took liberty of adding his file and other tests around it. I believe tjis is ready for full review.

Note this shouldn't go in until experimental PR to flux-core: flux-framework/flux-core#1465 lands.

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongahn! This is a huge boost in functionality.

I just have two minor nits. They are totally optional given the importance of merging this PR.

sched/sched.c Outdated
Jadd_int64 (child_gpu, "req_qty", job->req->ngpus);
/* setting size == 1 devotes (all of) the gpu to the job */
Jadd_int64 (child_gpu, "req_size", 1);
/* setting exclusive to true prevents multiple jobs per core */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"core" -> "gpu"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will fix it.

@@ -276,10 +280,15 @@ int rsreader_hwloc_load (resrc_api_ctx_t *rsapi, const char *buf, size_t len,
const char *s = rs2rank_get_digest (sig);
if (!resrc_generate_hwloc_resources (rsapi, topo, s, err_str))
goto err;

free (aux);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this free is redundant given the free at the end of the function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will fix! Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will fix! Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will fix! Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will fix! Thanks.

@dongahn
Copy link
Member Author

dongahn commented Apr 20, 2018

@SteVwonder: I updated this PR against the new flux-core PR: flux-framework/flux-core#1480

But somehow t2002-easy.t is failing again. If you have a moment, can you take a brief look at this and see what might have gone wrong?

FAIL: t2002-easy.t 3 - jobs scheduled in correct order 

@dongahn
Copy link
Member Author

dongahn commented Apr 21, 2018

FYI the flux core PR has been merged so looking into the test failure on easy backfill should be a bit easier...

@coveralls
Copy link

coveralls commented Apr 21, 2018

Coverage Status

Coverage increased (+0.2%) to 75.956% when pulling 28164da on dongahn:gpu_support into 511dbe0 on flux-framework:master.

@SteVwonder
Copy link
Member

I spent a few minutes on this before bed. I tried running make check to reproduce the error (using flux-core/master and flux-sched/gpu-support) and got:

RROR: t0002-waitjob.t - missing test plan
ERROR: t0002-waitjob.t - exited with status 1
ERROR: t0003-basic-install.t - missing test plan
ERROR: t0003-basic-install.t - exited with status 1
ERROR: t0004-rdltool.t - missing test plan
ERROR: t0004-rdltool.t - exited with status 1
ERROR: t1000-jsc.t - missing test plan
ERROR: t1000-jsc.t - exited with status 1
ERROR: t1001-rs2rank-basic.t - missing test plan
ERROR: t1001-rs2rank-basic.t - exited with status 1
ERROR: t1002-rs2rank-64ranks.t - missing test plan
ERROR: t1002-rs2rank-64ranks.t - exited with status 1

I will dig into this further tomorrow morning.

@dongahn
Copy link
Member Author

dongahn commented Apr 21, 2018

Hmmm. I haven't seen these errors. I checked the output from a CI test, and it seemed only tests that failed are emulator tests...

0;32mPASS�[m: t2001-fcfs-aware.t 1 - sim: started successfully
�[0;31mFAIL�[m: t2001-fcfs-aware.t 2 - sim: scheduled and ran all jobs
�[0;31mFAIL�[m: t2001-fcfs-aware.t 3 - jobs scheduled in correct order
�[0;32mPASS�[m: t2001-fcfs-aware.t 4 - sim: unloaded
�[0;35mERROR�[m: t2001-fcfs-aware.t - exited with status 1
�[0;32mPASS�[m: t2002-easy.t 1 - sim: started successfully
�[0;32mPASS�[m: t2002-easy.t 2 - sim: scheduled and ran all jobs
�[0;31mFAIL�[m: t2002-easy.t 3 - jobs scheduled in correct order
�[0;32mPASS�[m: t2002-easy.t 4 - sim: unloaded
�[0;35mERROR�[m: t2002-easy.t - exited with status 1
�[0;32mPASS�[m: t2003-fcfs-inorder.t 1 - sim: started successfully with queue-depth=1
�[0;31mFAIL�[m: t2003-fcfs-inorder.t 2 - sim: scheduled and ran all jobs with queue-depth=1
�[0;32mPASS�[m: t2003-fcfs-inorder.t 3 - jobs scheduled in correct order with queue-depth=1
�[0;32mPASS�[m: t2003-fcfs-inorder.t 4 - sim: unloaded
�[0;35mERROR�[m: t2003-fcfs-inorder.t - exited with status 1

@SteVwonder
Copy link
Member

The errors that I posted were my bad. I didn't have my luarocks module loaded, so lua.posix couldn't be imported.

I think I found out why the easy-backfill test is failing. It seems that the walltime for jobs is always 0:

JscEvent being queued - JSON: {"jobid": 1, "state-pair": {"ostate": 1, "nstate": 3}, "rdesc": {"nnodes": 1, "ntasks": 16, "ncores": 16, "ngpus": 0, "walltime": 0}}, errnum: 0
[cut]
JscEvent being queued - JSON: {"jobid": 2, "state-pair": {"ostate": 1, "nstate": 3}, "rdesc": {"nnodes": 100, "ntasks": 1600, "ncores": 1600, "ngpus": 0, "walltime": 0}}, errnum: 0
[cut]
JscEvent being queued - JSON: {"jobid": 3, "state-pair": {"ostate": 1, "nstate": 3}, "rdesc": {"nnodes": 100, "ntasks": 1600, "ncores": 1600, "ngpus": 0, "walltime": 0}}, errnum: 0

My guess is a problem in the simulator with JSON unpacking. If so, I should have this fixed shortly.

@dongahn
Copy link
Member Author

dongahn commented Apr 21, 2018

@SteVwonder: oh great! Thank you for looking into this (and Sat)!

@SteVwonder
Copy link
Member

Ok. Problem resolved with a small fix in JSC. There was a missing s:i in the unpack format string in the get_submit_jcb function. I created a PR in flux-core: flux-framework/flux-core#1482

@@ -46,6 +46,7 @@ typedef struct {
double time_limit;
int nnodes;
int ncpus;
int ngpus;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small thing that I noticed when looking into the easy backfilling failure, this variable is not initialized in the blank_job function in simulator.c (I should have called the function new_job, my apologies on the poor naming there). Do you mind adding/amending a commit to initialize this to 0 in blank_job?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Will force a push.

@dongahn
Copy link
Member Author

dongahn commented Apr 21, 2018

Ok. Problem resolved with a small fix in JSC. There was a missing s:i in the unpack format string in the get_submit_jcb function. I created a PR in flux-core: flux-framework/flux-core#1482

Oops. Thank you for catching this so quickly.

@codecov-io
Copy link

codecov-io commented Apr 21, 2018

Codecov Report

Merging #313 into master will increase coverage by 0.11%.
The diff coverage is 97.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
+ Coverage   74.14%   74.25%   +0.11%     
==========================================
  Files          49       49              
  Lines        9511     9540      +29     
==========================================
+ Hits         7052     7084      +32     
+ Misses       2459     2456       -3
Impacted Files Coverage Δ
simulator/submitsrv.c 79% <ø> (ø) ⬆️
sched/sched.c 73.63% <100%> (+1.14%) ⬆️
sched/rs2rank.c 93.27% <100%> (-1.56%) ⬇️
simulator/simulator.c 90.79% <100%> (+0.03%) ⬆️
sched/rsreader.c 96.02% <87.5%> (-0.51%) ⬇️
sched/flux-waitjob.c 84.42% <0%> (-1.64%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 511dbe0...28164da. Read the comment docs.

@dongahn
Copy link
Member Author

dongahn commented Apr 21, 2018

OK. pushed.

@@ -137,6 +137,7 @@ job_t *blank_job ()
job->time_limit = 0;
job->nnodes = 0;
job->ncpus = 0;
job->ngpu = 0;
Copy link
Member

@SteVwonder SteVwonder Apr 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  CC       libflux_sim_la-simulator.lo
../../../simulator/simulator.c: In function ‘blank_job’:
../../../simulator/simulator.c:140:8: error: ‘job_t’ has no member named ‘ngpu’
     job->ngpu = 0;

Looks like travis failed on compile on this line. This should be job->ngpus.

@dongahn
Copy link
Member Author

dongahn commented Apr 21, 2018

working while fishing is never good.... i will do this after come home. sorry.

dongahn added 4 commits April 21, 2018 21:27
In hwloc reader mode under multiple ranks on a node,
the hwloc data reported from the these ranks are exactly identical.
In this configuration, rs2rank groups those ranks
as a equivalent set and round-robins across for them
for execution requests.

This is the correct semantics when we use rdl reader
since we use this hwloc data only to link the resrc data
with those equivalent ranks. However, when we use hwloc
reader only mode, it is incorrect. We should rather
treat each rank as a distinct resource set
to facilitate testing.

Fix this issue by introducing an auxilliary field
to the signature input.

Allow each reported, identical hwloc data to
generate a different signature using this field.
Merge @TWRS' change and augment it.

Propagte the gpu request information received
from flux submit to the request input object
for scheduling.

gpu becomes a constraint for resrc's resource
type matching logic.
@dongahn
Copy link
Member Author

dongahn commented Apr 22, 2018

@SteVwonder: Should be ready to go in. Thank you for all the help. I know you are busy.

@SteVwonder SteVwonder merged commit 3ed8516 into flux-framework:master Apr 22, 2018
@trws
Copy link
Member

trws commented Apr 23, 2018

This looks pretty good. Am I reading correctly that this works in the sched end, but is currently not hooked up to gpu requests from core wreck/submit?

@dongahn
Copy link
Member Author

dongahn commented Apr 23, 2018

@trws: It should be hooked up to the gnu request from core wreck/submit. Try the current core master.

flux-framework/flux-core#1480
flux-framework/flux-core#1482

@trws
Copy link
Member

trws commented Apr 23, 2018

Got it. Glad to be out of date for once, thanks for all the hard work on this @dongahn and @SteVwonder!

@dongahn
Copy link
Member Author

dongahn commented Apr 23, 2018

Thank you for your initial code as well @trws!

@grondo grondo mentioned this pull request May 11, 2018
@dongahn dongahn deleted the gpu_support branch July 13, 2019 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants