-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set ntasks to nnodes if ntasks isn't given in flux mini command (minor issue) #4228
Comments
I was going to suggest this same change. It was an early requirement that the count of tasks (or slots for We do also have the proposal in #4214, which, if enabled might mean that users could not use |
Yes this is same as |
I don't think @ryanday36 was confused by the reason for the error message, the specific question here is whether we can default the number of requested tasks/slots to the number of nodes if the number of tasks is not explicitly given on the command line. |
I guess it's fine. I think it was me arguing the task count, as a fundamental parameter of the parallel job (its size), should not be influenced by the quantity of a particular resource requested. But that's a bit pedantic and if it surprising to users I'll back down 😄. Maybe a compromise would be to have |
Yeah. I'd argue that the current behavior still makes an assumption about the size of the parallel job. As @dongahn points out, it's just assuming that the job has a size of 1 task, which doesn't seem like as good of an assumption as 1 task / node. |
and particularly for |
I would be fine with this, though my guess is after awhile users will complain about the pedantic error message and ask for it to be removed or an option to suppress it, and then they'll ask for that option to be the default... |
FWIW, I think a design question is when one or more higher level constraints are given, should our front end tool fill up the minimum satisfying lower-level constraints or we will make that a user requirement. At first glance, it will become harder for users to determine the minimum and the system doing this for them makes sense to me. If users want to learn about this, perhaps our front end tool can print something out under a debug or verbosity flag. |
Sure, good point. I mainly wanted to say that this is not the sword I want to fall on today. Go ahead and make it better! |
So... ok. I'm trying to swap this context back in, but this is what I think I remember: In run, the slot-oriented interface that still needs to get finished, the thought was that normally resources should be derived from tasks. If you want one node per task, you ask for one node per task then specify tasks. In a slurm-like interface, users are used to asking for N of something, and getting some number of tasks. Having it default to one per node when none of: tasks, cores per task, tasks per node, etc. are specified seems like a reasonable thing to do to make folks comfortable. The thing I would fight strongly is if anyone suggested allowing |
Ok, well put @trws! This is indeed a strong argument against a default of number of tasks equaling number of nodes when tasks is not specified. This is because the default task slot for I think this argument makes me reconsider @garlick's suggestion to emit an error message if we do change the default behavior. I do wish there was some clean way where we could change the default based on configuration of the current Flux instance. |
I'm not sure that this really supports not making reasonable assumptions when the user specifies -N, but not -n. This seems to me more like an argument about what -c and -g should be if the user doesn't specify them. I.e. that
should be the same as
rather than
I could also get behind that, but it seems like a different question than what should happen with ntasks with |
|
W/ node-exclusive scheduling,
W/ node-local scheduling
The proposal is: Can we extend this such that: W/ node-exclusive scheduling,
W/ node-local scheduling
which I think reasonable. Am I missing something? |
I feel slow. I can't wrap my head around where this can become an issue. Do you have an example @grondo? I can see why explicitly given |
@dongahn, I think that sums up my initial argument, but I'm not sure it matches what @trws was suggesting. That seemed to me more like an argument that -N without -n, -c, or -g should give users all of the resources on the node regardless of node-exclusive vs. node-local. It seems more related to #3149 about whether there should be something like Slurm's |
I was in fact arguing that if a user specifies I have met very, very few users who haven't directly worked on the source code of a resource manager that see It's not fun, and is probably the worst thing about slurm CLI syntax, you can't know what As to how this maps in to the rest of the discussion, I think I'm saying that the default task slot should be:
Then the question of whether |
Thanks @ryanday36. My main argument would be that our front end tools are in a way a translator that translates command line options to our jobspec so the better we use consistent terms b/n two, the less confusing. For me, he has been pretty easy to reason about: using - type: node
count: 2 If you always want the whole nodes exclusively, you would set - type: node
count: 2
exclusive: true If "exclusive" is not added to the jobspec, the scheduler should decide wither node should be allocated exclusively or not. I fear, interpreting |
@trws: that's a good explanation. I can see where you and @grondo are coming from much better now. Having said that, with hierarchical nature of flux and future extension of needing to support more resource types, I am unsure if we want to model the slurm's single level node-centric semantics here though. If a flux instance is managing a half of the node and --nodes switch was used and we can't give the whole node. I know in this case you can give the whole half node resources but node is just one type of high level resources. How about socket? How about blades? As we start to include special casing, things can become quickly unwieldy.
Maybe this is a good compromise. |
There are kinda two points here, that would be a fine compromise for the task issue, but on the A user that writes their script on a normal cluster, which will almost certainly be node exclusive, then runs that script inside a sub-instance will be surprised and annoyed if |
In general, (and as you noted) this cannot be consistent by Flux's hierarchical design and scheduler specialization. A jobspec should express a set of constraints and it is upto the scheduler to decide how apply these constraints to assign resource sets according to its policy. As you alluded: Say, the user's script at the system instance uses That's a part of may reason why we shouldn't treat node as a special case. So my comment on "Maybe this is a good compromise."
I'd argue again we do this consistently. User specifies a constrain and its upto the scheduler to resolve the constraint according to its policy
If the policy is "socket exclusive scheduling", it will give all cores underneath it. |
I'm definitely agreeing with @trws more and more here. Can the principle maybe be generalized as saying that an instance should get all of the resources that are children of the resources that the user asks for unless the user explicitly asks for fewer child resources? That is probably closest to what people would intuitively expect. I do also wonder if I've chosen a bad example by starting this out asking about |
I think I see where some of the confusion and worry is here. Exclusivity, unless something has changed, was default true inside the slot and default false outside it last I recall (much like you're saying @ryanday36, if a slot contains an instance of something the user gets the whole instance), so that may be the only difference in concept here. If you're considering Short jobspec translations of the way my brain interprets these:
Due to the limitations of jobspec v1, the former would have to be expressed as something like: resources:
- type: node
exclusive: true
count: 2
with:
- type: slot The scheduler should never, ever, see -N. It should see whatever that gets transformed into by the command line interface. This is part of the problem with modeling this on slurm's interface, it's fundamentally inconsistent depending on the configuration or the combination of options. If the preference is to do exactly what slurm does, then there's an argument to be made for that, but if it's going to be different, I'd prefer to be different in the direction of more rather than less consistency. As to your last point @ryanday36 (hi BTW, not sure we've spoken much 👋 ) I agree about shorthand. If the behavior of |
Yup :-) Good that the scheduler only has to deal with RFC 14 spec. |
Sorry -- something still bothers me and I need a bit more convincing
Here, -N 2 is interpreted in two different ways, and I can't see why this would make our system more consistent. In a heterogenous environment like Corona where core per node count is different, how would the user specify node exclusivity while using the node-local resource shape as the selection criteria. Say we have three nodes, the first node with 1 core, 2nd node with 2 cores and third node with 3 cores and the user wants to get a whole node with at least 2 cores. So only either 2nd or third node should be execlusively allocated.
won't give the whole node under this semantics. The use can throw To me, keeping the semantics of -N same regardless of other options seems more consistent and easier to reason about. Having said, I understand you want the most common use case of |
This would be true, except Jobspec V1 does not support
Here's a dumb proposal that may work for now the
|
@grondo: Thank you for turning this down into an actionable proposal. I can be convinced if other folks find
If users want nodes exclusive, let them specify so. So If users want to get a whole node with at least 2 cores on a heterogenous environment like here, let them specify so |
It satisfies my concern. To your point @dongahn, I think this is key:
For me, as a user, I consider an interface to be more consistent if my command is consistently evaluated to the same thing regardless of configuration I can't see. That remains true even if arguments I specify and can see on the command line change the behavior of other such arguments. That's my opinion, not an immutable truth, but it's where I'm coming from. Now, the options we've been discussing give these behaviors if I understand them correctly. If not, feel free to correct:
Note that the lower pair match each other 100%, regardless of system One other thing here, is Either way around, it sounds like bullet one from @grondo's proposal is a must. |
FWIW, I am in total agreement with @trws as he summarized excellently in the previous comment.
Well I agree with @trws here: then a naive user might think that without There is also a limitation of the I think the best we can hope for is some consistency, which I agree we do not have in the current situation. Also, nothing prevents us from also adding an @dongahn: If we add I think for sched-simple (which is really only used in testing), we would reject these jobs as infeasible (though maybe it would be easy to support exclusive node allocations, but I'm not sure I see the point). If required, Fluxion could do the same if node exclusive matching policy is not in effect? This may be a small step forward, but I think our users would thank us for it. |
@trws and @grondo: Thanks. I have much more clarity on this issue. Just to add a bit more details to show where I'm coming from.
=> My understanding is:
=> My understanding is:
There really isn't a practical way we can get consistent behavior independent of system configurations. If users want consistency, they will have to be more explicit about exclusivity and let the scheduler decide whether to satisfy or reject. |
I agree. Perhaps the argument should be "what makes users happier" when we can't provide a single consistent solution. |
Yes, I think fluxion can do this. Let me test. We don't have 'emit shadow resource` support in this case but at least we can give exclusive access. |
OK. I confirmed this is supported (with no shadowed resource emission support).
|
It takes the whole node, certainly, but if "cores-per-task" is 1, then the task will be run with single-core containment anyway right? From the scheduler perspective it's the whole node, but not from a user/binding perspective.
Same here, as far as the user is concerned, they get one core each on two nodes. Whether the whole node was allocated to them is a cost issue potentially but the user's job runs the same here and below.
Yup, I completely agree. That's a big part of why I don't like these flags much. |
In a system instance, the command will be If we allowed |
We have an issue open on it #4214. Unfortunately, I can't think of a way to really determine the difference between |
Problem: It is inconvenient to require the specification of both ntasks and nnodes when a user wants one task/slot per node. Until recently, it was not possible to handle this in a coherent manner, though, so the Python Jobspec class and flux-mini commands throw an error whenever the node count is greater than the number of requested tasks/slots. Now that node exclusivity can be set in the jobspec, though, it is possible to set ntasks/slots to the number of nodes (when ntasks is not explicitly set), by also defaulting the node exclusive flag to True for this case. This allows `flux mini run -N4 command` to work consistently regardless of whether or not the enclosing instance defaults to node exclusive allocation. Fixes flux-framework#4228
Problem: It is inconvenient to require the specification of both ntasks and nnodes when a user wants one task/slot per node. Until recently, it was not possible to handle this in a coherent manner, though, so the Python Jobspec class and flux-mini commands throw an error whenever the node count is greater than the number of requested tasks/slots. Now that node exclusivity can be set in the jobspec, though, it is possible to set ntasks/slots to the number of nodes (when ntasks is not explicitly set), by also defaulting the node exclusive flag to True for this case. This allows `flux mini run -N4 command` to work consistently regardless of whether or not the enclosing instance defaults to node exclusive allocation. Fixes flux-framework#4228
Problem: It is inconvenient to require the specification of both ntasks and nnodes when a user wants one task/slot per node. Until recently, it was not possible to handle this in a coherent manner, though, so the Python Jobspec class and flux-mini commands throw an error whenever the node count is greater than the number of requested tasks/slots. Now that node exclusivity can be set in the jobspec, though, it is possible to set ntasks/slots to the number of nodes (when ntasks is not explicitly set), by also defaulting the node exclusive flag to True for this case. This allows `flux mini run -N4 command` to work consistently regardless of whether or not the enclosing instance defaults to node exclusive allocation. Fixes flux-framework#4228
I noticed when testing node exclusive scheduling that
flux mini run -N2 hostname
produces an error:This also occurs without node exclusive scheduling, but it seems more likely for users to hit it with node exclusive scheduling on. It seems like it would be reasonable for the task count to be set to the node count if it's not specified by the user.
The text was updated successfully, but these errors were encountered: