Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support node exclusive allocations #4245

Merged
merged 17 commits into from
Mar 31, 2022
Merged

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Mar 29, 2022

This PR adds an --exclusive flag in the flux-mini set of commands, which only works when -N, --nodes is also specified, and sets the exclusive key to true on the node vertex in jobspec. Support for node exclusive requests is added to sched-simple (mainly for testing), and finally the number of tasks/slots in flux mini commands is defaulted to the number of nodes, and the exclusive flag enabled, when number of tasks is not otherwise specified on the command line (fixes #4228).

As noted above, the new --exclusive flag only works if -N, --nodes is also specified. I was having trouble deciding if Jobspec V1 could even support setting the exclusive flag on nodes without also specifying the number of required nodes, though I admittedly didn't spend too much time on it. The current approach addresses the main concern here, which was defaulting ntasks/slots to nnodes in a coherent manner, so that seemed good enough for now.

There is a difference in behavior between Fluxion and sched-simple when allocating nodes exclusively. Fluxion will emit R with only the minimum resources required by the jobspec on each node (but all resources on the node will show as allocated), whereas sched-simple currently emits all exclusive node resources in the emitted R. Thus, for sched-simple:

$ flux mini alloc -N4 flux resource list
     STATE NNODES   NCORES    NGPUS NODELIST
      free      4       16        0 fluke[7-10]
 allocated      0        0        0 
      down      0        0        0 

whereas with Fluxion:

$ flux mini alloc -N4 flux resource list
     STATE NNODES   NCORES    NGPUS NODELIST
      free      4        4        0 fluke[7-10]
 allocated      0        0        0 
      down      0        0        0 

I'm not sure, but this seems like something we'll want to fix. Though I may be missing some subtlety here, it seems like it would be surprising behavior if users ask for nodes exclusively but only get one core from each node in their child instance. @dongahn?

@grondo
Copy link
Contributor Author

grondo commented Mar 29, 2022

It occurs to me I should get @trws and @ryanday36's blessing on the end result here, i.e. that flux mini <command> --nodes=X now doesn't throw an error. Instead it acts the same as flux mini <command> --nodes=X --nslots=X --exclusive. I hope that is generally what we agreed to as the most reasonable solution in #4228. (I may have missed the mark though!)

@ryanday36
Copy link

That sounds about like what I was hoping for. Thanks @grondo!

@trws
Copy link
Member

trws commented Mar 29, 2022 via email

@grondo
Copy link
Contributor Author

grondo commented Mar 30, 2022

Great! Thanks @ryanday36 and @trws!

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had one comment, not a show stopper either. Great work here!

I did give this a manual poke and LGTM.

Truth be known, I seem to type -N without -n a lot, despite knowing better. It's hard to unlearn that muscle memory. So at minimum this change is going reduce reduce my annoyed-with-self time.

Comment on lines 1812 to 1813
static struct rlist *rlist_alloc_nnodes (struct rlist *rl,
struct rlist_alloc_info *ai)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup!

Not sure if this is feasible, but in functions where the struct rlist_alloc_info parameter is not modified, can it be declared const? Parameters that were obviously "in" before could now be sneakily "in/out", so the extra declaration could clarify that.

Copy link
Contributor Author

@grondo grondo Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a good suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've made that change and force-pushed the result. Thanks!

grondo added 17 commits March 30, 2022 16:30
Problem: Passing allocation parameters like nnodes, nslots, and slot
size on as separate parameters to rlist allocation functions makes
extensibility difficult.

Add a new rlist_alloc_info structure which contains allocation parameters
for rlist allocation functions. Refactor code internally to take advantage
of the new structure.
Problem: librlist doesn't expose a public allocation function that
takes the more convenient struct rlist_alloc_info structure.

Add rlist_alloc_ex() which offers an alternative to rlist_alloc()
(so that all callers do not need to be immediately updated.). Since
job constraints are (optionally) provided in the alloc_info argument
to rlist_alloc_ex(), handle constraints directly in this function,
instead of requiring the caller to filter the rlist by constraint
before calling rlist_alloc().
Problem: rlist_alloc_ex() is a preferred interface over rlist_alloc(),
but there are still a lot of users of the older rlist_alloc()
interface.

Add rl_alloc() wrapper for rlist_alloc_ex() to test/rlist.c and switch
all callers. This moves the unit test away from the older rlist_alloc()
and begins to exercise the preferred rlist_alloc_ex() interface.
Problem: If rnode_match_validate() fails, rlist_copy_constraint()
returns an error without errno set. This is inconvenient because
functions like flux_respond_error() etc. expect a nonzero errno.

Set errno to EINVAL if rnode_match_validate() fails in
rlist_copy_constraint().
Problem: rlist_alloc() no longer has any callers

Rename rlist_alloc_ex() to rlist_alloc() and remove the latter.

Update all callers.
Problem: Jobspec v1 supports an exclusive flag for node exclusive
allocations, but rlist_alloc() does not have any support for an
exclusive flag.

Add an exclusive flag to `struct rlist_alloc_info`. Support exclusive
node allocation in rlist_alloc() via the new flag, but only when
there is an explicit request for a number of nodes.
Problem: None of the librlist unit tests exercise the exclusive node
allocation option of rlist_alloc().

Add a small set of exclusive node allocation tests to
librlist/test/rlist.c
Problem: Jobspec v1 supports an exclusive flag, but the libjj jobspec
reader doesn't support reading that flag from jobspec.

Add an exclusive flags in the "jj_counts" structure returned by
libjj_get_counts() and libjj_get_counts_json().
Problem: The sched-simple test program jj-reader doesn't emit the
exclusive flags along with the other data gleaned from the provided
test jobspec, and thus cannot be used to ensure the exclusive flag
is detected.

Print the value detected for the exclusive flag in the jj-reader
output. Update existing test.
Problem: Jobspec from_*_command() constructors do not have a way to
specify the exclusive flag to request nodes exclusively.

Add an exclusive flag to all constructors, which defaults to False.
Since the exclusive flag must be specified on a node resource vertex,
raise an error if the exclusive flag is requested without also
specifying an explicit node count.
Problem: The jobspec reader in sched-simple supports reading an
exclusive node flag from jobspec, but sched-simple doesn't pass this
flag to the librlist allocator, so effectively the exclusive flag is
not supported.

Pass the exclusive flag from jobspec to rlist_alloc() in sched-simple
so that exclusive node requests are supported.
Problem: Jobspec v1 supports an exclusive flag to request exclusive
allocation of nodes, but flux-mini commands have interface to set
this flag in generated jobspec.

Add an --exclusive flag to flux-mini which simply passses the flag
down to the various Jobspec from_command() constructors.
Problem: A test in the testsuite ensure that flux mini run --nodes=2
fails due to a "node count cannot be greater than task count" error,
but soon this usage will be allowed. (ntasks will default to nnodes
with exclusive flag set)

Remove the test to avoid breaking the testsuite.
Problem: It is inconvenient to require the specification of both ntasks
and nnodes when a user wants one task/slot per node. Until recently,
it was not possible to handle this in a coherent manner, though, so
the Python Jobspec class and flux-mini commands throw an error whenever
the node count is greater than the number of requested tasks/slots.

Now that node exclusivity can be set in the jobspec, though, it is
possible to set ntasks/slots to the number of nodes (when ntasks is
not explicitly set), by also defaulting the node exclusive flag to
True for this case.

This allows `flux mini run -N4 command` to work consistently regardless
of whether or not the enclosing instance defaults to node exclusive
allocation.

Fixes flux-framework#4228
Problem: No tests in the testsuite exercise the flux-mini --exclusive
flag or implicit exclusive node requests when --nodes is used without
--ntasks or --nslots.

Add a set of tests to the flux-mini sharness scripts that exercise
expected behavior for flux-mini --exclusive, both explicit and
implicit use.
Problem: The flux-mini(1) --exclusive option is not documented in
the man page.

Add documentation for the --exclusive option. Also mention that the
exclusive option will default to true when --nodes is used without
--ntasks or --nslots.
@codecov
Copy link

codecov bot commented Mar 31, 2022

Codecov Report

Merging #4245 (8f32ad3) into master (d483acd) will increase coverage by 0.03%.
The diff coverage is 87.12%.

@@            Coverage Diff             @@
##           master    #4245      +/-   ##
==========================================
+ Coverage   83.51%   83.55%   +0.03%     
==========================================
  Files         386      386              
  Lines       64659    64719      +60     
==========================================
+ Hits        54003    54077      +74     
+ Misses      10656    10642      -14     
Impacted Files Coverage Δ
src/modules/sched-simple/sched.c 77.87% <80.00%> (+0.36%) ⬆️
src/bindings/python/flux/job/Jobspec.py 83.24% <83.33%> (-0.09%) ⬇️
src/common/librlist/rlist.c 84.47% <86.74%> (-0.17%) ⬇️
src/cmd/flux-mini.py 92.18% <93.33%> (+0.14%) ⬆️
src/bindings/python/flux/resource/Rlist.py 94.25% <100.00%> (+0.06%) ⬆️
src/modules/sched-simple/libjj.c 82.05% <100.00%> (+0.71%) ⬆️
src/common/libpmi/pmi2.c 88.60% <0.00%> (-1.27%) ⬇️
src/common/libsubprocess/subprocess.c 88.54% <0.00%> (-0.46%) ⬇️
src/broker/overlay.c 86.82% <0.00%> (+0.10%) ⬆️
... and 10 more

@garlick
Copy link
Member

garlick commented Mar 31, 2022

MWP?

@grondo
Copy link
Contributor Author

grondo commented Mar 31, 2022

Yes, setting MWP now.

@mergify mergify bot merged commit 41cb8ea into flux-framework:master Mar 31, 2022
@grondo grondo deleted the node-exclusive branch September 12, 2022 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

set ntasks to nnodes if ntasks isn't given in flux mini command (minor issue)
4 participants