Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set ntasks to nnodes if ntasks isn't given in flux mini command (minor issue) #4228

Closed
ryanday36 opened this issue Mar 16, 2022 · 36 comments · Fixed by #4245
Closed

set ntasks to nnodes if ntasks isn't given in flux mini command (minor issue) #4228

ryanday36 opened this issue Mar 16, 2022 · 36 comments · Fixed by #4245

Comments

@ryanday36
Copy link

I noticed when testing node exclusive scheduling that flux mini run -N2 hostname produces an error:

[day36@fluke108:~]$ flux mini run -N 2 hostname
flux-mini: ERROR: node count must not be greater than task count
[day36@fluke108:~]$

This also occurs without node exclusive scheduling, but it seems more likely for users to hit it with node exclusive scheduling on. It seems like it would be reasonable for the task count to be set to the node count if it's not specified by the user.

@grondo
Copy link
Contributor

grondo commented Mar 16, 2022

I was going to suggest this same change.

It was an early requirement that the count of tasks (or slots for mini alloc and batch I think) not default to the number of nodes for these utilities. It does seem like this approach is an undue burden when node exclusive scheduling is in effect. I think @trws and @garlick were the main proponents of requiring an explicit use of -n, so we should get their buy in on changing this behavior.

We do also have the proposal in #4214, which, if enabled might mean that users could not use flux mini run directly to run jobs in the system instance (or at least this would be discouraged, since I'm not sure if there is a way to definitively determine what tool was used to submit a jobspec). In this case, maybe we only need to make the change in flux mini alloc and flux mini batch.

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

Yes this is same as flux mini run -N 2 -n1 hostname so the error message.

@grondo
Copy link
Contributor

grondo commented Mar 16, 2022

Yes this is same as flux mini run -N 2 -n1 hostname so the error message.

I don't think @ryanday36 was confused by the reason for the error message, the specific question here is whether we can default the number of requested tasks/slots to the number of nodes if the number of tasks is not explicitly given on the command line.

@garlick
Copy link
Member

garlick commented Mar 16, 2022

I guess it's fine.

I think it was me arguing the task count, as a fundamental parameter of the parallel job (its size), should not be influenced by the quantity of a particular resource requested. But that's a bit pedantic and if it surprising to users I'll back down 😄. Maybe a compromise would be to have flux-mini print something on stderr when task count is unspecified and it is being set based on some heuristic?

@ryanday36
Copy link
Author

Yeah. I'd argue that the current behavior still makes an assumption about the size of the parallel job. As @dongahn points out, it's just assuming that the job has a size of 1 task, which doesn't seem like as good of an assumption as 1 task / node.

@grondo
Copy link
Contributor

grondo commented Mar 16, 2022

and particularly for flux mini alloc and flux mini batch the -n parameter does not set the job "size", that is done by a hint to the job shell in options.per-resource.type=node, which tells it to ignore the specified slots and instead run one application (in this case flux broker) per node resource in R.

@grondo
Copy link
Contributor

grondo commented Mar 16, 2022

Maybe a compromise would be to have flux-mini print something on stderr when task count is unspecified and it is being set based on some heuristic?

I would be fine with this, though my guess is after awhile users will complain about the pedantic error message and ask for it to be removed or an option to suppress it, and then they'll ask for that option to be the default...

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

FWIW, -N in flux is just one kind of higher level resource constraints, in the future we may end up introduce things like -s for socket or even -R rack constraints.

I think a design question is when one or more higher level constraints are given, should our front end tool fill up the minimum satisfying lower-level constraints or we will make that a user requirement.

At first glance, it will become harder for users to determine the minimum and the system doing this for them makes sense to me. If users want to learn about this, perhaps our front end tool can print something out under a debug or verbosity flag.

@garlick
Copy link
Member

garlick commented Mar 16, 2022

Sure, good point. I mainly wanted to say that this is not the sword I want to fall on today. Go ahead and make it better!

@trws
Copy link
Member

trws commented Mar 16, 2022

So... ok. I'm trying to swap this context back in, but this is what I think I remember:

In run, the slot-oriented interface that still needs to get finished, the thought was that normally resources should be derived from tasks. If you want one node per task, you ask for one node per task then specify tasks.

In a slurm-like interface, users are used to asking for N of something, and getting some number of tasks. Having it default to one per node when none of: tasks, cores per task, tasks per node, etc. are specified seems like a reasonable thing to do to make folks comfortable.

The thing I would fight strongly is if anyone suggested allowing -N to mean cores sometimes, but not always. As far as I'm concerned N always means a node, a whole node, unless a number of cores or something else is specified to say otherwise.

@grondo
Copy link
Contributor

grondo commented Mar 16, 2022

The thing I would fight strongly is if anyone suggested allowing -N to mean cores sometimes, but not always. As far as I'm concerned N always means a node, a whole node, unless a number of cores or something else is specified to say otherwise.

Ok, well put @trws! This is indeed a strong argument against a default of number of tasks equaling number of nodes when tasks is not specified. This is because the default task slot for flux mini commands is 1 core, so transitively a request of -N without -n does mean that on a normal instance you would get a single core from each node, while on a node-exclusive instance you would get whole nodes, however many cores that may be.

I think this argument makes me reconsider @garlick's suggestion to emit an error message if we do change the default behavior.

I do wish there was some clean way where we could change the default based on configuration of the current Flux instance.

@ryanday36
Copy link
Author

ryanday36 commented Mar 16, 2022

As far as I'm concerned N always means a node, a whole node, unless a number of cores or something else is specified to say otherwise.

I'm not sure that this really supports not making reasonable assumptions when the user specifies -N, but not -n. This seems to me more like an argument about what -c and -g should be if the user doesn't specify them. I.e. that

flux mini run -N 1

should be the same as

flux mini run -N 1 -n 1 -c ALL_CORES_ON_NODE -g ALL_GPUS_ON_NODE

rather than

flux mini run -N 1 -n 1 -c 1 -g 0

I could also get behind that, but it seems like a different question than what should happen with ntasks with flux mini run -N 2.

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

-N<num> currently specifies nothing but the node constraint, it doesn't imply whole num node is requested. If we wanted this to be a whole node, we either needs to change what it specifies or introduce another option to mean it.

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

@ryanday36:

W/ node-exclusive scheduling,

flux mini run -N 1 == flux mini run -N1 -n1 -c ALL_CORES_ON_NODE -g ALL_GPUS_ON_NODE

W/ node-local scheduling

flux mini run -N 1 == flux mini run -N 1 -n1 -c1

The proposal is:

Can we extend this such that:

W/ node-exclusive scheduling,

flux mini run -N 2 == flux mini run -N2 -n2 -c ALL_CORES_ON_NODE -g ALL_GPUS_ON_NODE

W/ node-local scheduling

flux mini run -N 2 == flux mini run -N 2 -n2 -c1

which I think reasonable. Am I missing something?

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

This is because the default task slot for flux mini commands is 1 core, so transitively a request of -N without -n does mean that on a normal instance you would get a single core from each node, while on a node-exclusive instance you would get whole nodes, however many cores that may be.

I feel slow. I can't wrap my head around where this can become an issue. Do you have an example @grondo? I can see why explicitly givenflux mini run -N2 -n 1 -c 1 should error out. But I have hard time to see the side effect of flux mini run -N2 implying flux mini run -N2 -n2 -c 1.

@ryanday36
Copy link
Author

@dongahn, I think that sums up my initial argument, but I'm not sure it matches what @trws was suggesting. That seemed to me more like an argument that -N without -n, -c, or -g should give users all of the resources on the node regardless of node-exclusive vs. node-local. It seems more related to #3149 about whether there should be something like Slurm's --exclusive allocation flag. My argument here is just that flux mini * -N2 should assume that the user wants enough tasks to run the job rather than error out. I'm generally agnostic on the question of whether it should tell the user that it's giving them enough tasks, though I would lean in the direction that it is what the user probably expects, so we don't need to tell them.

@trws
Copy link
Member

trws commented Mar 16, 2022

I was in fact arguing that if a user specifies --nodes 2 then they get all the resources in two nodes unless they explicitly specify that they want less.

I have met very, very few users who haven't directly worked on the source code of a resource manager that see -N and expect to get less than the full node. It's a very common error that causes a whole lot of problems when people go to run say -N 2 -n 4 and they get two tasks on each node, both of which are bound to a single core, and try to run 20 threads each.

It's not fun, and is probably the worst thing about slurm CLI syntax, you can't know what -N means without reading system-level configs. There's nothing we can do about -n having different behavior with node exclusive and node local, but -N could at least be consistent.

As to how this maps in to the rest of the discussion, I think I'm saying that the default task slot should be:

  • 1 core by default
  • if --nodes is specified it becomes a node
  • if --cores-per-task is specified that overrides either of the previous

Then the question of whether -N 2 should also imply a minimum of two tasks is left. In slurm's docs, if you put -N4 -n 2 it will run it, and just leave nodes idle. When -n is unspecified it increases tasks to match nodes. I would be alright with that I think, it's intuitively what a user is likely to want if they haven't specified the number of tasks. That said, it makes it more complicated and harder to reason about, so leaving it a "don't do that user" thing seems alright as well.

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

Thanks @ryanday36.

My main argument would be that our front end tools are in a way a translator that translates command line options to our jobspec so the better we use consistent terms b/n two, the less confusing.

For me, he has been pretty easy to reason about: using -N num simply as a node count constraint (as shown below), not implying exclusivity in and of itself.

- type: node
   count: 2

If you always want the whole nodes exclusively, you would set exclusive: true in your jobspec (as below) likewise you would need a new "exclusive" switch to flux mini.

- type: node
   count: 2
   exclusive: true

If "exclusive" is not added to the jobspec, the scheduler should decide wither node should be allocated exclusively or not.

I fear, interpreting -N option in different ways for different things can cause grave confusion at the scheduler code level.

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

@trws: that's a good explanation. I can see where you and @grondo are coming from much better now. Having said that, with hierarchical nature of flux and future extension of needing to support more resource types, I am unsure if we want to model the slurm's single level node-centric semantics here though. If a flux instance is managing a half of the node and --nodes switch was used and we can't give the whole node. I know in this case you can give the whole half node resources but node is just one type of high level resources. How about socket? How about blades? As we start to include special casing, things can become quickly unwieldy.

That said, it makes it more complicated and harder to reason about, so leaving it a "don't do that user" thing seems alright as well.

Maybe this is a good compromise.

@trws
Copy link
Member

trws commented Mar 16, 2022

There are kinda two points here, that would be a fine compromise for the task issue, but on the -N issue it's a question of whether it should be consistent in the face of different system configurations or not. In my mind, it should be.

A user that writes their script on a normal cluster, which will almost certainly be node exclusive, then runs that script inside a sub-instance will be surprised and annoyed if -N does something different by no fault of theirs. If anything I'm asking for Node to be treated as much like other resources as possible. It's not possible to make it completely consistent because nodes (or accessible pieces of nodes as you point out @dongahn) are not consistent, that's a big part of why a node is a crummy thing to ask for in many cases. That said, if a user asks for a socket, do we give them a core and call it good? What if they ask for 4 cores, do we give them four hyperthreads instead?

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

on the -N issue it's a question of whether it should be consistent in the face of different system configurations or not.

In general, (and as you noted) this cannot be consistent by Flux's hierarchical design and scheduler specialization.

A jobspec should express a set of constraints and it is upto the scheduler to decide how apply these constraints to assign resource sets according to its policy.

As you alluded: Say, the user's script at the system instance uses flux mini run -N 2 app and the scheduler gives the whole two nodes. If users want to run the same script on a sub instance (ensemble) where each broker only has half of a node, this command will run app across a completely different number of cores.

That's a part of may reason why we shouldn't treat node as a special case. So my comment on "Maybe this is a good compromise."

That said, if a user asks for a socket, do we give them a core and call it good?

I'd argue again we do this consistently. User specifies a constrain and its upto the scheduler to resolve the constraint according to its policy

-s 1 -c 1 will give 1 core if the scheduler is node local scheduler.

If the policy is "socket exclusive scheduling", it will give all cores underneath it.

@ryanday36
Copy link
Author

I'm definitely agreeing with @trws more and more here. Can the principle maybe be generalized as saying that an instance should get all of the resources that are children of the resources that the user asks for unless the user explicitly asks for fewer child resources? That is probably closest to what people would intuitively expect.

I do also wonder if I've chosen a bad example by starting this out asking about flux mini run and tasks. It's probably pretty unlikely that a user will be trying to ask for multiple nodes and not specifying how many tasks they expect to run on those nodes. The more likely real world use case is probably something like a user who is trying to interactively debug some mpi problem and wants an allocation (flux instance) with two nodes. They might plan to run multiple flux mini run commands inside of that instance with different combinations of nodes, tasks, cores, etc., so it's natural for them to only want to specify the highest level of resource that they need and just run flux mini alloc -N 2. If we are going to require that the user also specify -n, -c , etc. if they want all of the resources, we should probably at least provide a shorthand way of doing that so that users don't have to know the underlying details of how man cores per socket, sockets per node, nodes per rack, etc there are on a given cluster.

@trws
Copy link
Member

trws commented Mar 16, 2022

Thanks @ryanday36.

My main argument would be that our front end tools are in a way a translator that translates command line options to our jobspec so the better we use consistent terms b/n two, the less confusing.

For me, he has been pretty easy to reason about: using -N num simply as a node count constraint (as shown below), not implying exclusivity in and of itself.

- type: node
   count: 2

If you always want the whole nodes exclusively, you would set exclusive: true in your jobspec (as below) likewise you would need a new "exclusive" switch to flux mini.

- type: node
   count: 2
   exclusive: true

If "exclusive" is not added to the jobspec, the scheduler should decide wither node should be allocated exclusively or not.

I fear, interpreting -N option in different ways for different things can cause grave confusion at the scheduler code level.

I think I see where some of the confusion and worry is here.

Exclusivity, unless something has changed, was default true inside the slot and default false outside it last I recall (much like you're saying @ryanday36, if a slot contains an instance of something the user gets the whole instance), so that may be the only difference in concept here. If you're considering -N to be outside the slot, then yes it wouldn't be exclusive, but I think that should only be the case if something more specific or smaller is specified with say -c. Looking back at jobspec v1, slot can only be inside the node level right now for some reason, which I didn't incorporate into my mental model.

Short jobspec translations of the way my brain interprets these:

run -N 2: Slot>Node[2] or Slot[2]>Node
run -N 2 -n 2 -c 2: Node[2]>Slot>Core[2]

Due to the limitations of jobspec v1, the former would have to be expressed as something like:

resources:
- type: node
  exclusive: true
  count: 2
  with:
    - type: slot

The scheduler should never, ever, see -N. It should see whatever that gets transformed into by the command line interface. This is part of the problem with modeling this on slurm's interface, it's fundamentally inconsistent depending on the configuration or the combination of options. If the preference is to do exactly what slurm does, then there's an argument to be made for that, but if it's going to be different, I'd prefer to be different in the direction of more rather than less consistency.

As to your last point @ryanday36 (hi BTW, not sure we've spoken much 👋 ) I agree about shorthand. If the behavior of -N by itself has to be inconsistent, then --exclusive has to be there. It's the only way I retain my sanity working with slurm, and it's the normal way I tell people to use slurm whenever possible, because it's the only way to get any kind of consistency out of the interface.

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

The scheduler should never, ever, see -N. It should see whatever that gets transformed into by the command line interface.

Yup :-) Good that the scheduler only has to deal with RFC 14 spec.

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

Sorry -- something still bothers me and I need a bit more convincing

run -N 2: Slot>Node[2] or Slot[2]>Node
run -N 2 -n 2 -c 2: Node[2]>Slot>Core[2]

Here, -N 2 is interpreted in two different ways, and I can't see why this would make our system more consistent. In a heterogenous environment like Corona where core per node count is different, how would the user specify node exclusivity while using the node-local resource shape as the selection criteria.

Say we have three nodes, the first node with 1 core, 2nd node with 2 cores and third node with 3 cores and the user wants to get a whole node with at least 2 cores. So only either 2nd or third node should be execlusively allocated.

run -N1 -n1 -c2

won't give the whole node under this semantics. The use can throw --exclusive or -Nx to make it happen. Why run -N should be different?

To me, keeping the semantics of -N same regardless of other options seems more consistent and easier to reason about.

Having said, I understand you want the most common use case of run -N just work. One possibility is to support this through an additional exclusive option or run -Nx<num> if needed.

@grondo
Copy link
Contributor

grondo commented Mar 17, 2022

Due to the limitations of jobspec v1, the former would have to be expressed as something like:

resources:
- type: node
  exclusive: true
  count: 2
  with:
    - type: slot

This would be true, except Jobspec V1 does not support exclusive at this time. This was probably a mistake, but a forgivable one because there was more fundamental things we had to work on first...

won't give the whole node under this semantics. The use can throw --exclusive or -Nx

-Nxisn't valid option parsing syntax, so that won't work. To support it either way, we're going to have to introduce exclusive support into jobspec v1.

Here's a dumb proposal that may work for now the mini commands:

  • introduce exclusive support in RFC 25 which is optionally supported by the scheduler
  • If only -N is specified, then set exclusive flag on the node
  • Reject jobs in the scheduler feasibility check if exclusive is set and the scheduler is not set to node exclusive, or cannot otherwise make an exclusive allocation (e.g. sched-simple would always reject exclusive requests). Thus, without node exclusive allocation enabled flux mini run -N NNODES ... would fail as it does now.
  • If at some point exclusive works, then flux mini run -N nnodes will work the same whether node exclusive scheduling is enabled by default or not, which I think satisfies our main concern here?

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

@grondo: Thank you for turning this down into an actionable proposal. I can be convinced if other folks find -N alone should imply --exclusive, but like I posted above, the following modification is an alternative.

If only -N is specified, then set exclusive flag on the node

If users want nodes exclusive, let them specify so. So -N 1 --exclusive

If users want to get a whole node with at least 2 cores on a heterogenous environment like here, let them specify so -N1 --exclusive -n1 -c2

@trws
Copy link
Member

trws commented Mar 17, 2022

It satisfies my concern. To your point @dongahn, I think this is key:

To me, keeping the semantics of -N same regardless of other options seems more consistent and easier to reason about.

For me, as a user, I consider an interface to be more consistent if my command is consistently evaluated to the same thing regardless of configuration I can't see. That remains true even if arguments I specify and can see on the command line change the behavior of other such arguments.

That's my opinion, not an immutable truth, but it's where I'm coming from.

Now, the options we've been discussing give these behaviors if I understand them correctly. If not, feel free to correct:

  • -N is always "number of nodes, does not impact exclusivity or cores"
    • Exclusive Mode:
      • -N 2: all cores on two nodes
      • -N 2 -c1: one core each on two nodes
      • -N 2 -c1 -n 4: two cores each on two nodes
      • -N 2 -n 4: all cores on two nodes, split half and half between tasks
    • Local Mode:
      • -N 2: one core each on two nodes
      • -N 2 -c1: one core each on two nodes
      • -N 2 -c1 -n 4: two cores each on two nodes
      • -N 2 -n 4: two cores each on two nodes
  • -N implies exclusive while alone, not while together with -c
    • Exclusive Mode:
      • -N 2: all cores on two nodes
      • -N 2 -c1: one core each on two nodes
      • -N 2 -c1 -n 4: two cores each on two nodes
      • -N 2 -n 4: all cores on two nodes, split half and half between tasks
    • Local Mode:
      • -N 2: all cores on two nodes
      • -N 2 -c1: one core each on two nodes
      • -N 2 -c1 -n 4: two cores each on two nodes
      • -N 2 -n 4: all cores on two nodes, split half and half between tasks

Note that the lower pair match each other 100%, regardless of system
configuration. The upper pair are more internally consistent while in local
mode, but not in exclusive mode, in my opinion. If we allow adding in
--exclusive, then either set can express most things, there's just no way to
request less than a node with system-level exclusive configuration, but that's
fine since that's the policy we want.

One other thing here, is -N2 -c1 -n2 allowed to put both on the same node? Not
sure the jobspec representation explicitly specifies either way.

Either way around, it sounds like bullet one from @grondo's proposal is a must.
My preference is for taking the rest as well, but if at least the option is
there then it's tolerable.

@grondo
Copy link
Contributor

grondo commented Mar 17, 2022

FWIW, I am in total agreement with @trws as he summarized excellently in the previous comment.

If users want nodes exclusive, let them specify so. So -N 1 --exclusive

Well I agree with @trws here: then a naive user might think that without --exclusive they will get non-exclusive allocation even on a system with lonodex or hinodex configuration.

There is also a limitation of the flux mini command line interface. If a user specifies --exclusive on the command line, to what does it apply? I think at some point we have to realize that a strict, simplified command line interface as presented by flux mini is not going to be able to satisfy all objectives here.

I think the best we can hope for is some consistency, which I agree we do not have in the current situation.

Also, nothing prevents us from also adding an --exclusive flag to the flux mini tools, perhaps documenting that all this does is force the exclusive flag on the node resource in these tools.

@dongahn: If we add exclusive to the Jobspec V1 RFC, perhaps only allowed at the node resource, would it be supported in instances without lonodex or hinodex match policies?

I think for sched-simple (which is really only used in testing), we would reject these jobs as infeasible (though maybe it would be easy to support exclusive node allocations, but I'm not sure I see the point). If required, Fluxion could do the same if node exclusive matching policy is not in effect?

This may be a small step forward, but I think our users would thank us for it.

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

@trws and @grondo: Thanks. I have much more clarity on this issue.

Just to add a bit more details to show where I'm coming from.

@trws:

  • -N is always "number of nodes, does not impact exclusivity or cores"
    • Exclusive Mode:
      • -N 2: all cores on two nodes
      • -N 2 -c1: one core each on two nodes
      • -N 2 -c1 -n 4: two cores each on two nodes
      • -N 2 -n 4: all cores on two nodes, split half and half between tasks
    • Local Mode:
      • -N 2: one core each on two nodes
      • -N 2 -c1: one core each on two nodes
      • -N 2 -c1 -n 4: two cores each on two nodes
      • -N 2 -n 4: two cores each on two nodes

=> My understanding is:

  • -N is always "number of nodes, does not impact exclusivity or cores"
    • Exclusive Mode:
      • -N 2: all cores on two nodes (assuming this implies -n 2)
      • -N 2 -c1: all cores on two nodes each node with at least one core (assuming this implies -n 2)
      • -N 2 -c1 -n 4: all cores on two nodes each node with at least two cores (because 2 slots must spread across 2 nodes)
      • -N 2 -n 4: all cores on two nodes, each node with at least 2 cores (because this is equal as above)
    • Local Mode:
      • -N 2: one core each on two nodes
      • -N 2 -c1: one core each on two nodes
      • -N 2 -c1 -n 4: two cores each on two nodes
      • -N 2 -n 4: two cores each on two nodes

-N implies exclusive while alone, not while together with -c

  • Exclusive Mode:

    • -N 2: all cores on two nodes
    • -N 2 -c1: one core each on two nodes
    • -N 2 -c1 -n 4: two cores each on two nodes
    • -N 2 -n 4: all cores on two nodes, split half and half between tasks
  • Local Mode:

    • -N 2: all cores on two nodes
    • -N 2 -c1: one core each on two nodes
    • -N 2 -c1 -n 4: two cores each on two nodes
    • -N 2 -n 4: all cores on two nodes, split half and half between tasks

=> My understanding is:
-N implies exclusive while alone, not while together with -c

  • Exclusive Mode:

    • -N 2: all cores on two nodes
    • -N 2 -c1: all cores on two nodes each node with at least one core (assuming this implies -n 2)
    • -N 2 -c1 -n 4: all cores on two nodes each with at least two cores (because 2 slots must spread across 2 nodes)
    • -N 2 -n 4: all cores on two nodes, each node with at least 2 cores (because this is equal as above)
  • Local Mode:

    • -N 2: all cores on two nodes
    • -N 2 -c1: one core each on two nodes
    • -N 2 -c1 -n 4: two cores each on two nodes
    • -N 2 -n 4: all cores on two nodes, split half and half between tasks

There really isn't a practical way we can get consistent behavior independent of system configurations. If users want consistency, they will have to be more explicit about exclusivity and let the scheduler decide whether to satisfy or reject.

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

I think the best we can hope for is some consistency, which I agree we do not have in the current situation.

I agree. Perhaps the argument should be "what makes users happier" when we can't provide a single consistent solution.

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

@dongahn: If we add exclusive to the Jobspec V1 RFC, perhaps only allowed at the node resource, would it be supported in instances without lonodex or hinodex match policies?

Yes, I think fluxion can do this. Let me test. We don't have 'emit shadow resource` support in this case but at least we can give exclusive access.

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

OK. I confirmed this is supported (with no shadowed resource emission support).

t.json is -N2 -n2 -c 1 with no node exclusivity. t2.json is with node exclusivity. Using the default first policy.

ahn1@docker-desktop:/usr/src$ flux start -s 4
ahn1@docker-desktop:/usr/src$ flux resource list
     STATE NNODES   NCORES    NGPUS NODELIST
      free      4       24        0 docker-desktop,docker-desktop,docker-desktop,docker-desktop
 allocated      0        0        0
      down      0        0        0

ahn1@docker-desktop:/usr/src$ flux job submit t2.json
ƒpuUB2BR
ahn1@docker-desktop:/usr/src$ flux resource list
     STATE NNODES   NCORES    NGPUS NODELIST
      free      2       12        0 docker-desktop,docker-desktop
 allocated      2       12        0 docker-desktop,docker-desktop
      down      0        0        0

ahn1@docker-desktop:/usr/src$ flux job submit t.json
ƒwX7hApw
ahn1@docker-desktop:/usr/src$ flux resource list
     STATE NNODES   NCORES    NGPUS NODELIST
      free      2       10        0 docker-desktop,docker-desktop
 allocated      4       14        0 docker-desktop,docker-desktop,docker-desktop,docker-desktop
      down      0        0        0

@trws
Copy link
Member

trws commented Mar 17, 2022

=> My understanding is:

* -N is always "number of nodes, does not impact exclusivity or cores"
  
  * Exclusive Mode:
    
    * -N 2: all cores on two nodes (assuming this implies -n 2)
    * -N 2 -c1: all cores on two nodes each node with at least one core (assuming this implies -n 2)

It takes the whole node, certainly, but if "cores-per-task" is 1, then the task will be run with single-core containment anyway right? From the scheduler perspective it's the whole node, but not from a user/binding perspective.

    * -N 2 -c1 -n 4: all cores on two nodes each node with at least two cores (because 2 slots must spread across 2 nodes)
    * -N 2 -n 4: all cores on two nodes, each node with at least 2 cores (because this is equal as above)
  * Local Mode:
    
    * -N 2: one core each on two nodes
    * -N 2 -c1: one core each on two nodes
    * -N 2 -c1 -n 4: two cores each on two nodes
    * -N 2 -n 4: two cores each on two nodes

-N implies exclusive while alone, not while together with -c

  • Exclusive Mode:

    • -N 2: all cores on two nodes
    • -N 2 -c1: one core each on two nodes
    • -N 2 -c1 -n 4: two cores each on two nodes
    • -N 2 -n 4: all cores on two nodes, split half and half between tasks
  • Local Mode:

    • -N 2: all cores on two nodes
    • -N 2 -c1: one core each on two nodes
    • -N 2 -c1 -n 4: two cores each on two nodes
    • -N 2 -n 4: all cores on two nodes, split half and half between tasks

=> My understanding is: -N implies exclusive while alone, not while together with -c

* Exclusive Mode:
  
  * -N 2: all cores on two nodes
  * -N 2 -c1: all cores on two nodes each node with at least one core (assuming this implies -n 2)

Same here, as far as the user is concerned, they get one core each on two nodes. Whether the whole node was allocated to them is a cost issue potentially but the user's job runs the same here and below.

  * -N 2 -c1 -n 4: all cores on two nodes each with at least two cores (because 2 slots must spread across 2 nodes)
  * -N 2 -n 4: all cores on two nodes, each node with at least 2 cores (because this is equal as above)

* Local Mode:
  
  * -N 2: all cores on two nodes
  * -N 2 -c1: one core each on two nodes
  * -N 2 -c1 -n 4: two cores each on two nodes
  * -N 2 -n 4: all cores on two nodes, split half and half between tasks

There really isn't a practical way we can get consistent behavior independent of system configurations. If users want consistency, they will have to be more explicit about exclusivity and let the scheduler decide whether to satisfy or reject.

Yup, I completely agree. That's a big part of why I don't like these flags much.

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

It takes the whole node, certainly, but if "cores-per-task" is 1, then the task will be run with single-core containment anyway right? From the scheduler perspective it's the whole node, but not from a user/binding perspective.

In a system instance, the command will be flux mini batch -N1 -c1 -n 4 ... which will be a whole node. Then within the allocation, you will have flux mini submit -N1 -c1 -n 4 ... or flux mini run -N1 -c1 -n 4 ... and that will be node local semantics (unless you specialize the scheduler policy with lonodex or hinodex).

If we allowed flux mini submit to be directly submitted to our system instance (are we planning this use case @grondo? I thought we don't because this will thrash the IO for the system instance too much?), the scheduler will give the whole node and it is up to the execution system to do containment and binding. Yes, I think the execution system will do what you are describing unless I'm not mistaken.

@grondo
Copy link
Contributor

grondo commented Mar 17, 2022

If we allowed flux mini submit to be directly submitted to our system instance (are we planning this use case @grondo? I thought we don't because this will thrash the IO for the system instance too much?),

We have an issue open on it #4214. Unfortunately, I can't think of a way to really determine the difference between flux mini submit|run vs flux mini alloc|batch or even flux job submit of an arbitrary jobspec. We may have to use a heuristic or some other method if we do want to limit this usage.

@grondo grondo added this to the flux-core v0.38.0 milestone Mar 29, 2022
grondo added a commit to grondo/flux-core that referenced this issue Mar 29, 2022
Problem: It is inconvenient to require the specification of both ntasks
and nnodes when a user wants one task/slot per node. Until recently,
it was not possible to handle this in a coherent manner, though, so
the Python Jobspec class and flux-mini commands throw an error whenever
the node count is greater than the number of requested tasks/slots.

Now that node exclusivity can be set in the jobspec, though, it is
possible to set ntasks/slots to the number of nodes (when ntasks is
not explicitly set), by also defaulting the node exclusive flag to
True for this case.

This allows `flux mini run -N4 command` to work consistently regardless
of whether or not the enclosing instance defaults to node exclusive
allocation.

Fixes flux-framework#4228
grondo added a commit to grondo/flux-core that referenced this issue Mar 29, 2022
Problem: It is inconvenient to require the specification of both ntasks
and nnodes when a user wants one task/slot per node. Until recently,
it was not possible to handle this in a coherent manner, though, so
the Python Jobspec class and flux-mini commands throw an error whenever
the node count is greater than the number of requested tasks/slots.

Now that node exclusivity can be set in the jobspec, though, it is
possible to set ntasks/slots to the number of nodes (when ntasks is
not explicitly set), by also defaulting the node exclusive flag to
True for this case.

This allows `flux mini run -N4 command` to work consistently regardless
of whether or not the enclosing instance defaults to node exclusive
allocation.

Fixes flux-framework#4228
grondo added a commit to grondo/flux-core that referenced this issue Mar 30, 2022
Problem: It is inconvenient to require the specification of both ntasks
and nnodes when a user wants one task/slot per node. Until recently,
it was not possible to handle this in a coherent manner, though, so
the Python Jobspec class and flux-mini commands throw an error whenever
the node count is greater than the number of requested tasks/slots.

Now that node exclusivity can be set in the jobspec, though, it is
possible to set ntasks/slots to the number of nodes (when ntasks is
not explicitly set), by also defaulting the node exclusive flag to
True for this case.

This allows `flux mini run -N4 command` to work consistently regardless
of whether or not the enclosing instance defaults to node exclusive
allocation.

Fixes flux-framework#4228
@mergify mergify bot closed this as completed in #4245 Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants