Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobspec: add level 0 support for moldable jobspecs #3944

Closed
wants to merge 2 commits into from

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Nov 5, 2021

Problem: job-list and jobshell currently assume
an integer value for the "count" key within
a jobspec. While this complies with
RFC 25 (Jobspec V1), this disallows users to submit
a moldable jobspec that contains a dictionary instead
with min/max/operator/operand with the count key.
Because moldability will soon be required to enable
node-exclusive scheduling for our system instance work,
we need level 0 support.

Modify parse_res_level() functions within job-list and
jobshell where this assumption is made.

Unpack the "count" key as a json_t object instead
of an integer object in those functions and subsequently
handle the moldable jobspec case where its value is a dictionary.

Does not change semantics whatsoever since this is
level-0 support. As such, the min count is used for these
components when a moldable jobspec is given.

@dongahn dongahn requested a review from grondo November 5, 2021 04:21
@grondo
Copy link
Contributor

grondo commented Nov 5, 2021

This looks good to me @dongahn -- seems like we could easily add a test to ensure the error case (at least in the shell jobspec parser) is handled correctly if we want a small coverage bump.

@dongahn
Copy link
Member Author

dongahn commented Nov 5, 2021

Ok. Sounds good. I will add some test cases. Do you have a suggestion about where to add it? Creating a new sharness test or extending an existing one?

@grondo
Copy link
Contributor

grondo commented Nov 6, 2021

I think it will be more trouble than it is worth to test the job-info failure case.

However, for the src/shell/jobspec.c changes, you could add a unit test or two in src/shell/test/jobspec.c which would be quite straightforward I think.

Problem: job-list and jobshell currently assume
an integer value for the "count" key within
a jobspec. While this complies with
RFC 25 (Jobspec V1), this disallows users to submit
a moldable jobspec that contains a dictionary instead
with min/max/operator/operand with the count key.
Because moldability will soon be required to enable
node-exclusive scheduling for our system instance work,
we need level 0 support.

Modify parse_res_level() functions within job-list and
jobshell where this assumption is made.

Unpack the "count" key as a json_t object instead
of an integer object in those functions and subsequently
handle the moldable jobspec case where its value is a dictionary.

Does not change semantics whatsoever since this is
level-0 support. As such, the min count is used for these
components when a moldable jobspec is given.
Extend the existing jobspec unit tests for
jobshell with moldable jobspecs.

Add good inputs to test that the jobspec parsing
code used by jobshell can handle both partitially
or fully qualified resource count spec where
only the "min" key is mandatory.

Limit testing resources to "core" and "gpu" where
moldable count spec is expected to be used
in a near term.
@dongahn dongahn changed the title [WIP] jobspec: add level 0 support for moldable jobspecs jobspec: add level 0 support for moldable jobspecs Nov 10, 2021
@dongahn
Copy link
Member Author

dongahn commented Nov 10, 2021

FYI -- Once this goes in, flux-framework/rfc#302 should go in as well after a review (of course).

@codecov
Copy link

codecov bot commented Nov 10, 2021

Codecov Report

Merging #3944 (20b3675) into master (c6a5843) will decrease coverage by 0.03%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #3944      +/-   ##
==========================================
- Coverage   83.55%   83.52%   -0.04%     
==========================================
  Files         359      359              
  Lines       52838    52850      +12     
==========================================
- Hits        44147    44141       -6     
- Misses       8691     8709      +18     
Impacted Files Coverage Δ
src/modules/job-list/job_state.c 76.18% <66.66%> (-0.33%) ⬇️
src/shell/jobspec.c 78.15% <66.66%> (-0.61%) ⬇️
src/modules/job-info/guest_watch.c 76.00% <0.00%> (-1.24%) ⬇️
src/modules/job-archive/job-archive.c 59.27% <0.00%> (-0.81%) ⬇️
src/broker/overlay.c 87.60% <0.00%> (-0.69%) ⬇️
src/broker/groups.c 85.86% <0.00%> (-0.35%) ⬇️

@garlick
Copy link
Member

garlick commented Nov 10, 2021

I've not been tracking this very closely, but just wanted to ask a question: Could we convert all our tools to use min instead of count and then have the scheduler alter its behavior based on configuration:

  • if node-exclusive = true, grant more than requested core/gpu resource if necessary to avoid granting a partial node
  • if node-exclusive = false, (the default like now), just grant the min like it was a count

Alternatively, we could leave count in place but relax the spec to allow it to be interpreted like min.

The prospect of requiring different user-facing tools on node-exclusive clusters seems kind of messed up to me. Our tools should work, even if wrappers are also available.

@grondo
Copy link
Contributor

grondo commented Nov 10, 2021

Now that @garlick mentions it, it also occurs to me that besides just the tools this approach may break the Python and C APIs for constructing jobspec, and also perhaps may require that existing users use a disjoint implementation for constructing jobspec depending on whether they are talking to a system instance or any other instance. This would break the key portability feature of Flux.

Alternatively, we could leave count in place but relax the spec to allow it to be interpreted like min.

I like this idea since it wouldn't break existing jobspec and users.

Could a global configuration flag in Fluxion just tell the resource match code to treat count as min for a set of resources (e.g. core and gpu to start)?

@dongahn
Copy link
Member Author

dongahn commented Nov 10, 2021

Good thoughts. Just to bring more clarity to this idea:

Could a global configuration flag in Fluxion just tell the resource match code to treat count as min for a set of resources (e.g. core and gpu to start)?

Because Fluxion already has full moldable jobspec support, I believe this can be done through config support -- we essentially tell Fluxion to use different semantics in interpreting count.

I believe it would benefit me to discuss a bit on how this would simplify our way of submit a node-exclusive jobspec using the existing interfaces like flux mini interfaces? Maybe we should create an example using flux mini batch command?

@dongahn
Copy link
Member Author

dongahn commented Nov 10, 2021

Overall I am a big fan of using existing interfaces with minimum viable changes. So I like this discussion thread.

@grondo
Copy link
Contributor

grondo commented Nov 10, 2021

I believe it would benefit me to discuss a bit on how this would simplify our way of submit a node-exclusive jobspec using the existing interfaces like flux mini interfaces? Maybe we should create an example using flux mini batch command?

Good idea.

Maybe an obvious point: If min is required in place of count for all system instance job submissions, then none of the flux-mini commands can be used as-is to submit to a system instacne, since they all use count.

We could switch the command to all use min instead of count, but then in a normal single user instance, the first job to request a core would use up a whole node, presumably, hurting high throughput workloads by wasting resources.

If we have a flag that tells Fluxion to treat count as if it were min for core and gpu, then if this flag were enabled only for the system instance, the flux mini batch could work in a system instance, where

$ flux mini batch -n 1 -c 1 script.sh

Would allocate at least one node and execute script.sh

whereas in any other instance the same command would allocate just one core for script.sh.

I guess that is the minimal flux mini batch example. It seems like the common case for our users would be requesting a number of nodes. Currently flux mini batch requires both -n and -N be set:

$ flux mini batch -N 4 script.sh
flux-mini: ERROR: Number of slots to allocate must be specified

Perhaps we relax this constraint and set -n to -N if -n is not specified to make a much nicer interface in the system instance case.

@dongahn
Copy link
Member Author

dongahn commented Nov 11, 2021

@grondo: sounds pretty promising. Two minor points.

  1. If fluxion is configured to be node-exclusive=true, I believe rejecting any jobspec without -N will simplify fluxion code.
$ flux mini batch -N 3 -n 3 -c 1 script.sh

instead of

$ flux mini batch -n 3 -c 1 script.sh

Do you see any issues with this?

  1. RFC 14 and 25 should be revised to the effect that the interpretation of count will be determined by the scheduler: ether "at least" or "exact".

Overall, I like this direction a lot.

@garlick
Copy link
Member

garlick commented Nov 11, 2021

If fluxion is configured to be node-exclusive=true, I believe rejecting any jobspec without -N will simplify fluxion code.

That seems not ideal, since it forces the user to know how many cores there are per node, if they are submitting a job that should just run one task per core regardless of the node type.

@dongahn
Copy link
Member Author

dongahn commented Nov 11, 2021

That seems not ideal, since it forces the user to know how many cores there are per node, if they are submitting a job that should just run one task per core regardless of the node type.

@garlick: Just to make sure we are on the same page, I am proposing #1 below is valid for system instance whereas #2 will be rejected by fluxion (with node-exclusive=true).

$ flux mini batch -N 3 -n 3 -c 1 script.sh
$ flux mini batch -n 3 -c 1 script.sh

Is your preference that system instance fluxion should accept #2 and interpret that as #1?

@garlick
Copy link
Member

garlick commented Nov 11, 2021

Is your preference that system instance fluxion should accept #2 and interpret that as #1

I was thinking fluxion should choose the minimum number of nodes to satisfy a request for some number of cores, within the exclusive node constraint.

So if I submit -n128 and nodes on this system have 64 cores, fluxion allocates me 2 nodes.

@dongahn
Copy link
Member Author

dongahn commented Nov 12, 2021

So if I submit -n128 and nodes on this system have 64 cores, fluxion allocates me 2 nodes.

Hmmm. I am not sure if we are not looking at the same page.

This is what fluxion would do currently. But I thought the proposal was to interpret count as min under node-exclusive=true. So with,

flux mini batch -n 128 -c 1

we are requesting 128 slots each with a minimum of 1 core and a maximum of infinity. So the slot will be specialized to be 64 cores in this case?

@dongahn
Copy link
Member Author

dongahn commented Nov 12, 2021

@grondo: Now that I think about it, interpreting count as min has another problem. With this, node[2]will request a minimum of 2 nodes and fluxion can generate far more number of nodes than when the semantics of count is "exact." It seems support for min is necessary? Thought?

@garlick
Copy link
Member

garlick commented Nov 12, 2021

we are requesting 128 slots each with a minimum of 1 core and a maximum of infinity. So the slot will be specialized to be 64 cores in this case?

Yes and the result would be the same for any -n value of 65...128 under node-exclusive = true. -N shouldn't be required since sched knows how many cores are on each node and the user might not know (and it might depend on which node the scheduler selects).

@dongahn
Copy link
Member Author

dongahn commented Nov 12, 2021

Yes and the result would be the same for any -n value of 65...128 under node-exclusive = true.

I think I am missing something. Sorry for being slow. If each slot has 64 cores, -n128 (128 slots) will result in 128 nodes (not 2 nodes)?

@garlick
Copy link
Member

garlick commented Nov 12, 2021

With this, node[2]will request a minimum of 2 nodes and fluxion can generate far more number of nodes than when the semantics of count is "exact." It seems support for min is necessary? Thought?

Under node-exclusive = true, it might make sense to only change the interpretation of count for gpus and cpus. I guess a user can argue that they are getting charged for more than they asked for in any case, but in the gpu/cpu case we can respond that the cluster policy is we don't share nodes so you have to pay for the fragmentation, too bad. But if we give them more nodes than they ask for, that would really be wrong if they haven't indicated that they want them.

@garlick
Copy link
Member

garlick commented Nov 12, 2021

I think I am missing something. Sorry for being slow. If each slot has 64 cores, -n128 (128 slots) will result in 128 nodes (not 2 nodes)?

If the user is requesting one proc per core, and 128 procs, then the scheduler should allocate the minimum number of nodes with at least 128 cores, correct?

@dongahn
Copy link
Member Author

dongahn commented Nov 12, 2021

Under node-exclusive = true, it might make sense to only change the interpretation of count for gpus and cpus.

Yes, this can be done.

@dongahn
Copy link
Member Author

dongahn commented Nov 12, 2021

If the user is requesting one proc per core, and 128 procs, then the scheduler should allocate the minimum number of nodes with at least 128 cores, correct?

Yes and this is what Fluxion does today (under a proper match policy) as it treats count as "exact". So there is no issue for Fluxion to be able to do this.

However, there is a known problem when a jobspec from a command like flux mini batch -n 1 is given. Under this semantics, Fluxion will only emit 1 core instead of 64 cores in the resulting R which leads to underutilization of cores. It is documented in RFC 14:

The default value for max SHALL be infinite, therefore a count which specifies only the min key SHALL be considered a request for at least that number of a resource, and the scheduler SHALL generate the R that contains the maximum number of the resource that is available and subject to the operator and operand. By contrast, if a fixed count is given to the count key, the scheduler SHALL match any resource that contains at least count of the resource, but its R SHALL contain exactly count of the resource (potentially leaving excess resources unutilized).

My fear of interpreting count different ways for many different cases would be that it will lead to various if/else cases within Fluxion which will be error prone.

I believe it would be far better to do this with consistence semantics with as few corner cases as possible.

I also would like to hear from @grondo since he also has a ton of experience in this.

@grondo
Copy link
Contributor

grondo commented Nov 12, 2021

I also would like to hear from @grondo since he also has a ton of experience in this.

I don't feel like I have any more experience in this area than you or @garlick.

I hope the suggestion to support min wasn't a bad idea. If so I apologize profusely.

The end goal here I think is to be able to support some sort of configuration or policy in Fluxion such that flux mini batch -n1 emits an R with all resources on the node on which the core was allocated, instead of just the one core. At first I thought using min instead of count would work, which it might for this case, but I now do see the problem with -n128, where each of the 128 slots specifies min:1 so they would all grab a whole node (is that the problem?).

What we need is a way for Fluxion to match resources with the policy it uses now (perhaps ensuring that the number of utilized nodes after the match is at least a local minimum), then go back and add all remaining resources to the emitted R (effectively). Would something like that work for node-exclusive = true?

Sorry If I'm just muddying the waters! 😞

I think part of the problem here might be a lack of familiarity with Fluxion internals and matching policy behavior.

@dongahn
Copy link
Member Author

dongahn commented Nov 12, 2021

I don't feel like I have any more experience in this area than you or @garlick.

I feel too close to this and an extra pair of eyes really does help.

The end goal here I think is to be able to support some sort of configuration or policy in Fluxion such that flux mini batch -n1 emits an R with all resources on the node on which the core was allocated, instead of just the one core. At first I thought using min instead of count would work, which it might for this case, but I now do see the problem with -n128, where each of the 128 slots specifies min:1 so they would all grab a whole node (is that the problem?).

Exactly.

What we need is a way for Fluxion to match resources with the policy it uses now (perhaps ensuring that the number of utilized nodes after the match is at least a local minimum), then go back and add all remaining resources to the emitted R (effectively). Would something like that work for node-exclusive = true?

Seems like the best option given that we have other side effects. Call it shadow resource emission support or something.

I will think through when my brain is a bit better.

@dongahn
Copy link
Member Author

dongahn commented Nov 12, 2021

And there is this: flux-framework/flux-sched#689

@garlick and @grondo: thank you for the discussion. This has been extremely helpful!

@dongahn
Copy link
Member Author

dongahn commented Dec 2, 2021

Closing this PR per our discussion at today's meeting.

@dongahn dongahn closed this Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants