-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU scheduling support #313
Conversation
Just posted an experimental PR to flux-core: flux-framework/flux-core#1465 |
Thanks @SteVwonder for the quick review.
Depends.
Admittedly complex... but we have introduced lots of different modes over time... |
Thanks to @grondo, I now have a working butte rdl with 4 GPU. So I took liberty of adding his file and other tests around it. I believe tjis is ready for full review. Note this shouldn't go in until experimental PR to flux-core: flux-framework/flux-core#1465 lands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongahn! This is a huge boost in functionality.
I just have two minor nits. They are totally optional given the importance of merging this PR.
sched/sched.c
Outdated
Jadd_int64 (child_gpu, "req_qty", job->req->ngpus); | ||
/* setting size == 1 devotes (all of) the gpu to the job */ | ||
Jadd_int64 (child_gpu, "req_size", 1); | ||
/* setting exclusive to true prevents multiple jobs per core */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"core" -> "gpu"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will fix it.
@@ -276,10 +280,15 @@ int rsreader_hwloc_load (resrc_api_ctx_t *rsapi, const char *buf, size_t len, | |||
const char *s = rs2rank_get_digest (sig); | |||
if (!resrc_generate_hwloc_resources (rsapi, topo, s, err_str)) | |||
goto err; | |||
|
|||
free (aux); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this free
is redundant given the free at the end of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will fix! Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will fix! Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will fix! Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will fix! Thanks.
54577a5
to
fd168b0
Compare
@SteVwonder: I updated this PR against the new flux-core PR: flux-framework/flux-core#1480 But somehow t2002-easy.t is failing again. If you have a moment, can you take a brief look at this and see what might have gone wrong?
|
FYI the flux core PR has been merged so looking into the test failure on easy backfill should be a bit easier... |
I spent a few minutes on this before bed. I tried running
I will dig into this further tomorrow morning. |
Hmmm. I haven't seen these errors. I checked the output from a CI test, and it seemed only tests that failed are emulator tests...
|
The errors that I posted were my bad. I didn't have my I think I found out why the
My guess is a problem in the simulator with JSON unpacking. If so, I should have this fixed shortly. |
@SteVwonder: oh great! Thank you for looking into this (and Sat)! |
Ok. Problem resolved with a small fix in JSC. There was a missing |
@@ -46,6 +46,7 @@ typedef struct { | |||
double time_limit; | |||
int nnodes; | |||
int ncpus; | |||
int ngpus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small thing that I noticed when looking into the easy backfilling failure, this variable is not initialized in the blank_job
function in simulator.c
(I should have called the function new_job
, my apologies on the poor naming there). Do you mind adding/amending a commit to initialize this to 0 in blank_job
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. Will force a push.
Oops. Thank you for catching this so quickly. |
Codecov Report
@@ Coverage Diff @@
## master #313 +/- ##
==========================================
+ Coverage 74.14% 74.25% +0.11%
==========================================
Files 49 49
Lines 9511 9540 +29
==========================================
+ Hits 7052 7084 +32
+ Misses 2459 2456 -3
Continue to review full report at Codecov.
|
OK. pushed. |
simulator/simulator.c
Outdated
@@ -137,6 +137,7 @@ job_t *blank_job () | |||
job->time_limit = 0; | |||
job->nnodes = 0; | |||
job->ncpus = 0; | |||
job->ngpu = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC libflux_sim_la-simulator.lo
../../../simulator/simulator.c: In function ‘blank_job’:
../../../simulator/simulator.c:140:8: error: ‘job_t’ has no member named ‘ngpu’
job->ngpu = 0;
Looks like travis failed on compile on this line. This should be job->ngpus
.
working while fishing is never good.... i will do this after come home. sorry. |
In hwloc reader mode under multiple ranks on a node, the hwloc data reported from the these ranks are exactly identical. In this configuration, rs2rank groups those ranks as a equivalent set and round-robins across for them for execution requests. This is the correct semantics when we use rdl reader since we use this hwloc data only to link the resrc data with those equivalent ranks. However, when we use hwloc reader only mode, it is incorrect. We should rather treat each rank as a distinct resource set to facilitate testing. Fix this issue by introducing an auxilliary field to the signature input. Allow each reported, identical hwloc data to generate a different signature using this field.
Merge @TWRS' change and augment it. Propagte the gpu request information received from flux submit to the request input object for scheduling. gpu becomes a constraint for resrc's resource type matching logic.
@SteVwonder: Should be ready to go in. Thank you for all the help. I know you are busy. |
This looks pretty good. Am I reading correctly that this works in the sched end, but is currently not hooked up to gpu requests from core wreck/submit? |
@trws: It should be hooked up to the gnu request from core wreck/submit. Try the current core master. |
Got it. Glad to be out of date for once, thanks for all the hard work on this @dongahn and @SteVwonder! |
Thank you for your initial code as well @trws! |
Note that this will fail in Travis because it needs @trws mod in flux-core. I will submit a PR to flux-core soon for that. This addresses the overscheduling issue (#311) that arises when running multiple ranks on a node.