jobspec: add level 0 support for moldable jobspecs #3944

dongahn · 2021-11-05T03:59:14Z

Problem: job-list and jobshell currently assume
an integer value for the "count" key within
a jobspec. While this complies with
RFC 25 (Jobspec V1), this disallows users to submit
a moldable jobspec that contains a dictionary instead
with min/max/operator/operand with the count key.
Because moldability will soon be required to enable
node-exclusive scheduling for our system instance work,
we need level 0 support.

Modify parse_res_level() functions within job-list and
jobshell where this assumption is made.

Unpack the "count" key as a json_t object instead
of an integer object in those functions and subsequently
handle the moldable jobspec case where its value is a dictionary.

Does not change semantics whatsoever since this is
level-0 support. As such, the min count is used for these
components when a moldable jobspec is given.

grondo · 2021-11-05T15:37:03Z

This looks good to me @dongahn -- seems like we could easily add a test to ensure the error case (at least in the shell jobspec parser) is handled correctly if we want a small coverage bump.

dongahn · 2021-11-05T15:41:16Z

Ok. Sounds good. I will add some test cases. Do you have a suggestion about where to add it? Creating a new sharness test or extending an existing one?

grondo · 2021-11-06T00:10:57Z

I think it will be more trouble than it is worth to test the job-info failure case.

However, for the src/shell/jobspec.c changes, you could add a unit test or two in src/shell/test/jobspec.c which would be quite straightforward I think.

Problem: job-list and jobshell currently assume an integer value for the "count" key within a jobspec. While this complies with RFC 25 (Jobspec V1), this disallows users to submit a moldable jobspec that contains a dictionary instead with min/max/operator/operand with the count key. Because moldability will soon be required to enable node-exclusive scheduling for our system instance work, we need level 0 support. Modify parse_res_level() functions within job-list and jobshell where this assumption is made. Unpack the "count" key as a json_t object instead of an integer object in those functions and subsequently handle the moldable jobspec case where its value is a dictionary. Does not change semantics whatsoever since this is level-0 support. As such, the min count is used for these components when a moldable jobspec is given.

Extend the existing jobspec unit tests for jobshell with moldable jobspecs. Add good inputs to test that the jobspec parsing code used by jobshell can handle both partitially or fully qualified resource count spec where only the "min" key is mandatory. Limit testing resources to "core" and "gpu" where moldable count spec is expected to be used in a near term.

dongahn · 2021-11-10T07:16:33Z

FYI -- Once this goes in, flux-framework/rfc#302 should go in as well after a review (of course).

codecov · 2021-11-10T07:24:27Z

Codecov Report

Merging #3944 (20b3675) into master (c6a5843) will decrease coverage by 0.03%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #3944      +/-   ##
==========================================
- Coverage   83.55%   83.52%   -0.04%     
==========================================
  Files         359      359              
  Lines       52838    52850      +12     
==========================================
- Hits        44147    44141       -6     
- Misses       8691     8709      +18

Impacted Files	Coverage Δ
src/modules/job-list/job_state.c	`76.18% <66.66%> (-0.33%)`	⬇️
src/shell/jobspec.c	`78.15% <66.66%> (-0.61%)`	⬇️
src/modules/job-info/guest_watch.c	`76.00% <0.00%> (-1.24%)`	⬇️
src/modules/job-archive/job-archive.c	`59.27% <0.00%> (-0.81%)`	⬇️
src/broker/overlay.c	`87.60% <0.00%> (-0.69%)`	⬇️
src/broker/groups.c	`85.86% <0.00%> (-0.35%)`	⬇️

garlick · 2021-11-10T17:24:20Z

I've not been tracking this very closely, but just wanted to ask a question: Could we convert all our tools to use min instead of count and then have the scheduler alter its behavior based on configuration:

if node-exclusive = true, grant more than requested core/gpu resource if necessary to avoid granting a partial node
if node-exclusive = false, (the default like now), just grant the min like it was a count

Alternatively, we could leave count in place but relax the spec to allow it to be interpreted like min.

The prospect of requiring different user-facing tools on node-exclusive clusters seems kind of messed up to me. Our tools should work, even if wrappers are also available.

grondo · 2021-11-10T17:33:42Z

Now that @garlick mentions it, it also occurs to me that besides just the tools this approach may break the Python and C APIs for constructing jobspec, and also perhaps may require that existing users use a disjoint implementation for constructing jobspec depending on whether they are talking to a system instance or any other instance. This would break the key portability feature of Flux.

Alternatively, we could leave count in place but relax the spec to allow it to be interpreted like min.

I like this idea since it wouldn't break existing jobspec and users.

Could a global configuration flag in Fluxion just tell the resource match code to treat count as min for a set of resources (e.g. core and gpu to start)?

dongahn · 2021-11-10T17:52:04Z

Good thoughts. Just to bring more clarity to this idea:

Could a global configuration flag in Fluxion just tell the resource match code to treat count as min for a set of resources (e.g. core and gpu to start)?

Because Fluxion already has full moldable jobspec support, I believe this can be done through config support -- we essentially tell Fluxion to use different semantics in interpreting count.

I believe it would benefit me to discuss a bit on how this would simplify our way of submit a node-exclusive jobspec using the existing interfaces like flux mini interfaces? Maybe we should create an example using flux mini batch command?

dongahn · 2021-11-10T17:53:32Z

Overall I am a big fan of using existing interfaces with minimum viable changes. So I like this discussion thread.

grondo · 2021-11-10T18:04:17Z

I believe it would benefit me to discuss a bit on how this would simplify our way of submit a node-exclusive jobspec using the existing interfaces like flux mini interfaces? Maybe we should create an example using flux mini batch command?

Good idea.

Maybe an obvious point: If min is required in place of count for all system instance job submissions, then none of the flux-mini commands can be used as-is to submit to a system instacne, since they all use count.

We could switch the command to all use min instead of count, but then in a normal single user instance, the first job to request a core would use up a whole node, presumably, hurting high throughput workloads by wasting resources.

If we have a flag that tells Fluxion to treat count as if it were min for core and gpu, then if this flag were enabled only for the system instance, the flux mini batch could work in a system instance, where

$ flux mini batch -n 1 -c 1 script.sh

Would allocate at least one node and execute script.sh

whereas in any other instance the same command would allocate just one core for script.sh.

I guess that is the minimal flux mini batch example. It seems like the common case for our users would be requesting a number of nodes. Currently flux mini batch requires both -n and -N be set:

$ flux mini batch -N 4 script.sh
flux-mini: ERROR: Number of slots to allocate must be specified

Perhaps we relax this constraint and set -n to -N if -n is not specified to make a much nicer interface in the system instance case.

dongahn · 2021-11-11T18:23:20Z

@grondo: sounds pretty promising. Two minor points.

If fluxion is configured to be node-exclusive=true, I believe rejecting any jobspec without -N will simplify fluxion code.

$ flux mini batch -N 3 -n 3 -c 1 script.sh

instead of

$ flux mini batch -n 3 -c 1 script.sh

Do you see any issues with this?

RFC 14 and 25 should be revised to the effect that the interpretation of count will be determined by the scheduler: ether "at least" or "exact".

Overall, I like this direction a lot.

garlick · 2021-11-11T18:43:39Z

If fluxion is configured to be node-exclusive=true, I believe rejecting any jobspec without -N will simplify fluxion code.

That seems not ideal, since it forces the user to know how many cores there are per node, if they are submitting a job that should just run one task per core regardless of the node type.

dongahn · 2021-11-11T23:52:08Z

That seems not ideal, since it forces the user to know how many cores there are per node, if they are submitting a job that should just run one task per core regardless of the node type.

@garlick: Just to make sure we are on the same page, I am proposing #1 below is valid for system instance whereas #2 will be rejected by fluxion (with node-exclusive=true).

$ flux mini batch -N 3 -n 3 -c 1 script.sh

$ flux mini batch -n 3 -c 1 script.sh

Is your preference that system instance fluxion should accept #2 and interpret that as #1?

garlick · 2021-11-11T23:59:47Z

Is your preference that system instance fluxion should accept #2 and interpret that as #1

I was thinking fluxion should choose the minimum number of nodes to satisfy a request for some number of cores, within the exclusive node constraint.

So if I submit -n128 and nodes on this system have 64 cores, fluxion allocates me 2 nodes.

dongahn · 2021-11-12T00:12:23Z

So if I submit -n128 and nodes on this system have 64 cores, fluxion allocates me 2 nodes.

Hmmm. I am not sure if we are not looking at the same page.

This is what fluxion would do currently. But I thought the proposal was to interpret count as min under node-exclusive=true. So with,

flux mini batch -n 128 -c 1

we are requesting 128 slots each with a minimum of 1 core and a maximum of infinity. So the slot will be specialized to be 64 cores in this case?

dongahn · 2021-11-12T00:17:52Z

@grondo: Now that I think about it, interpreting count as min has another problem. With this, node[2]will request a minimum of 2 nodes and fluxion can generate far more number of nodes than when the semantics of count is "exact." It seems support for min is necessary? Thought?

garlick · 2021-11-12T00:24:38Z

we are requesting 128 slots each with a minimum of 1 core and a maximum of infinity. So the slot will be specialized to be 64 cores in this case?

Yes and the result would be the same for any -n value of 65...128 under node-exclusive = true. -N shouldn't be required since sched knows how many cores are on each node and the user might not know (and it might depend on which node the scheduler selects).

dongahn · 2021-11-12T00:29:03Z

Yes and the result would be the same for any -n value of 65...128 under node-exclusive = true.

I think I am missing something. Sorry for being slow. If each slot has 64 cores, -n128 (128 slots) will result in 128 nodes (not 2 nodes)?

garlick · 2021-11-12T00:31:22Z

With this, node[2]will request a minimum of 2 nodes and fluxion can generate far more number of nodes than when the semantics of count is "exact." It seems support for min is necessary? Thought?

Under node-exclusive = true, it might make sense to only change the interpretation of count for gpus and cpus. I guess a user can argue that they are getting charged for more than they asked for in any case, but in the gpu/cpu case we can respond that the cluster policy is we don't share nodes so you have to pay for the fragmentation, too bad. But if we give them more nodes than they ask for, that would really be wrong if they haven't indicated that they want them.

garlick · 2021-11-12T00:32:43Z

I think I am missing something. Sorry for being slow. If each slot has 64 cores, -n128 (128 slots) will result in 128 nodes (not 2 nodes)?

If the user is requesting one proc per core, and 128 procs, then the scheduler should allocate the minimum number of nodes with at least 128 cores, correct?

dongahn · 2021-11-12T00:33:28Z

Under node-exclusive = true, it might make sense to only change the interpretation of count for gpus and cpus.

Yes, this can be done.

dongahn · 2021-11-12T00:55:46Z

If the user is requesting one proc per core, and 128 procs, then the scheduler should allocate the minimum number of nodes with at least 128 cores, correct?

Yes and this is what Fluxion does today (under a proper match policy) as it treats count as "exact". So there is no issue for Fluxion to be able to do this.

However, there is a known problem when a jobspec from a command like flux mini batch -n 1 is given. Under this semantics, Fluxion will only emit 1 core instead of 64 cores in the resulting R which leads to underutilization of cores. It is documented in RFC 14:

The default value for max SHALL be infinite, therefore a count which specifies only the min key SHALL be considered a request for at least that number of a resource, and the scheduler SHALL generate the R that contains the maximum number of the resource that is available and subject to the operator and operand. By contrast, if a fixed count is given to the count key, the scheduler SHALL match any resource that contains at least count of the resource, but its R SHALL contain exactly count of the resource (potentially leaving excess resources unutilized).

My fear of interpreting count different ways for many different cases would be that it will lead to various if/else cases within Fluxion which will be error prone.

I believe it would be far better to do this with consistence semantics with as few corner cases as possible.

I also would like to hear from @grondo since he also has a ton of experience in this.

grondo · 2021-11-12T01:13:27Z

I also would like to hear from @grondo since he also has a ton of experience in this.

I don't feel like I have any more experience in this area than you or @garlick.

I hope the suggestion to support min wasn't a bad idea. If so I apologize profusely.

The end goal here I think is to be able to support some sort of configuration or policy in Fluxion such that flux mini batch -n1 emits an R with all resources on the node on which the core was allocated, instead of just the one core. At first I thought using min instead of count would work, which it might for this case, but I now do see the problem with -n128, where each of the 128 slots specifies min:1 so they would all grab a whole node (is that the problem?).

What we need is a way for Fluxion to match resources with the policy it uses now (perhaps ensuring that the number of utilized nodes after the match is at least a local minimum), then go back and add all remaining resources to the emitted R (effectively). Would something like that work for node-exclusive = true?

Sorry If I'm just muddying the waters! 😞

I think part of the problem here might be a lack of familiarity with Fluxion internals and matching policy behavior.

dongahn · 2021-11-12T01:50:06Z

I don't feel like I have any more experience in this area than you or @garlick.

I feel too close to this and an extra pair of eyes really does help.

The end goal here I think is to be able to support some sort of configuration or policy in Fluxion such that flux mini batch -n1 emits an R with all resources on the node on which the core was allocated, instead of just the one core. At first I thought using min instead of count would work, which it might for this case, but I now do see the problem with -n128, where each of the 128 slots specifies min:1 so they would all grab a whole node (is that the problem?).

Exactly.

What we need is a way for Fluxion to match resources with the policy it uses now (perhaps ensuring that the number of utilized nodes after the match is at least a local minimum), then go back and add all remaining resources to the emitted R (effectively). Would something like that work for node-exclusive = true?

Seems like the best option given that we have other side effects. Call it shadow resource emission support or something.

I will think through when my brain is a bit better.

dongahn · 2021-11-12T01:53:41Z

And there is this: flux-framework/flux-sched#689

@garlick and @grondo: thank you for the discussion. This has been extremely helpful!

dongahn · 2021-12-02T22:19:38Z

Closing this PR per our discussion at today's meeting.

dongahn force-pushed the moldable_level0 branch from 8b3ecf8 to 0958d05 Compare November 5, 2021 04:00

dongahn requested a review from grondo November 5, 2021 04:21

dongahn added 2 commits November 9, 2021 22:49

dongahn force-pushed the moldable_level0 branch from 0958d05 to 20b3675 Compare November 10, 2021 06:49

dongahn changed the title ~~[WIP] jobspec: add level 0 support for moldable jobspecs~~ jobspec: add level 0 support for moldable jobspecs Nov 10, 2021

dongahn mentioned this pull request Nov 10, 2021

Add exclusive node scheduling policy for system-instance day 1 flux-framework/flux-sched#833

Closed

dongahn mentioned this pull request Dec 2, 2021

A configurable multilevel ID match policy plugin. flux-framework/flux-sched#894

Closed

dongahn closed this Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobspec: add level 0 support for moldable jobspecs #3944

jobspec: add level 0 support for moldable jobspecs #3944

dongahn commented Nov 5, 2021

grondo commented Nov 5, 2021

dongahn commented Nov 5, 2021

grondo commented Nov 6, 2021

dongahn commented Nov 10, 2021

codecov bot commented Nov 10, 2021

garlick commented Nov 10, 2021

grondo commented Nov 10, 2021

dongahn commented Nov 10, 2021

dongahn commented Nov 10, 2021

grondo commented Nov 10, 2021

dongahn commented Nov 11, 2021

garlick commented Nov 11, 2021

dongahn commented Nov 11, 2021

garlick commented Nov 11, 2021

dongahn commented Nov 12, 2021

dongahn commented Nov 12, 2021

garlick commented Nov 12, 2021

dongahn commented Nov 12, 2021

garlick commented Nov 12, 2021

garlick commented Nov 12, 2021

dongahn commented Nov 12, 2021

dongahn commented Nov 12, 2021 •

edited

Loading

grondo commented Nov 12, 2021

dongahn commented Nov 12, 2021

dongahn commented Nov 12, 2021

dongahn commented Dec 2, 2021

jobspec: add level 0 support for moldable jobspecs #3944

jobspec: add level 0 support for moldable jobspecs #3944

Conversation

dongahn commented Nov 5, 2021

grondo commented Nov 5, 2021

dongahn commented Nov 5, 2021

grondo commented Nov 6, 2021

dongahn commented Nov 10, 2021

codecov bot commented Nov 10, 2021

Codecov Report

garlick commented Nov 10, 2021

grondo commented Nov 10, 2021

dongahn commented Nov 10, 2021

dongahn commented Nov 10, 2021

grondo commented Nov 10, 2021

dongahn commented Nov 11, 2021

garlick commented Nov 11, 2021

dongahn commented Nov 11, 2021

garlick commented Nov 11, 2021

dongahn commented Nov 12, 2021

dongahn commented Nov 12, 2021

garlick commented Nov 12, 2021

dongahn commented Nov 12, 2021

garlick commented Nov 12, 2021

garlick commented Nov 12, 2021

dongahn commented Nov 12, 2021

dongahn commented Nov 12, 2021 • edited Loading

grondo commented Nov 12, 2021

dongahn commented Nov 12, 2021

dongahn commented Nov 12, 2021

dongahn commented Dec 2, 2021

dongahn commented Nov 12, 2021 •

edited

Loading