-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enforcing per-user limits #121
Comments
Max submitted jobs per-user could be enforced by a jobtap plugin (either the multifactor fairshare plugin or a standalone plugin). The plugin can keep a count of currently active jobs per user and reject any job that would exceed the maximum in the |
#131 has now landed, so this issue is halfway to being completed. An active jobs count is tracked for users in the flux-accounting DB and will reject any job that would exceed the maximum. Now what is left is to add a |
I briefly started to take a look at the requirements for adding a
$ flux mini submit -n 1 hostname then jobspec looks like this {
"resources": [
{
"type": "slot",
"count": 1,
"with": [
{
"type": "core",
"count": 1
}
],
}
],
} If I submit a job that specifies a number of nodes, like the following job: $ flux mini submit -N 1 hostname then the jobspec looks like: {
"resources": [
{
"type": "node",
"count": 1,
"with": [
{
"type": "slot",
"count": 1,
"with": [
{
"type": "core",
"count": 1
}
],
"label": "task"
}
]
}
],
} Would a correct approach be to try and extract a One note I think worth mentioning is that I think currently, when a user submits a job without specifying a number of nodes, like the first example I provided, the whole node is still allocated. EDIT (9/8): still a couple open-ended questions to consider from today's coffee hour:
|
Unfortunately, you will have to fetch the R for the job in order to count the number of allocated nodes. Since there are two use cases already for counting the number of allocated resources from a jobtap plugin, I wonder if there is something Flux can do to help here. One thing we could do is standardize some kind of resource summary in the alloc response from the scheduler. Maybe just to make a count of resources of each type available to the job-manager and its plugins? That would save a round trip to the KVS and some processing of R for the flux-accounting and cray plugins. (Not to mention they'd each be fetching R separately when used together!) Edit: another solution would be to just have the scheduler respond with R, and have the job-manager do the kvs_put(), or have the scheduler respond with a copy of R. I'll open an issue in flux-core to get some feedback on the various solutions. |
One other question: How does the max_nodes limit work? Is the intent to attempt to reject job requests that exceed the max_nodes limit, or is it a limit on the total number of nodes currently in use by a user (in which case this support might have to be implemented in the scheduler after all) |
Good question. I would assume we would want to enforce the same behavior as the |
That might be a difficult limit to enforce in the same way as |
Because I was tagged earlier I'll throw in my two bits.
I am pretty sure that there are systems on LC that have both a maximum on nodes per job, and a maximum on nodes per user across all active jobs---I could try to find examples if desired. I haven't checked whether the max_jobs limit is a limit on submitted or active jobs and the conversation has been a little confusing---but if you had a limit on active jobs per bank/user that would give you an OK replacement for a limit on nodes per bank/user across all active jobs:
So it sounds like a jobtap plugin that waits for the I think there would be a lot of users, experienced with Slurm or LSF, who would be extremely frustrated to have to wait to have their Flux jobs rejected due to invalid resource requirements. It would help a lot if there were a submission-time jobspec validator that would reject That front-end validation could end up covering almost every job if there were a wrapper around the |
Yes, @jameshcorbett you are correct. This problem is much more tractable when we solve the problem of node-centric scheduling in Flux (and that is probably where a max_nodes limit makes the most sense). Then any jobtap plugin can keep a running total of requested nodes per bank/user/whatever, and reject jobs immediately that exceed that limit. |
It has been a little while since I have gotten the chance to circle back to this, but now that node-exclusive scheduling capability has landed in flux-sched, I think it might be good for me to start work on tackling the As a refresher to myself, this limit would enforce the max number of nodes that can be used across all of a user/bank combo's running jobs. I know earlier in this thread it was mentioned that the number of nodes that will eventually be allocated to a pending job request is not known at the time of submission. Does that change with the introduction of node-exclusive scheduling? |
I had an offline conversation with @ryanday36 about the current behavior for enforcing a In another conversation with @grondo, he mentioned that a jobtap plugin will not have knowledge of how many nodes are eventually allocated to a job when it is first submitted, which might make it difficult for the plugin itself to enforce this limit. Perhaps we need a way to send this I can say for the flux-accounting side, sending information to its required destination (in this case, maybe the scheduler) via RPC should be relatively straightforward, since it already sends similar information the priority plugin. One side question that is not really related to the above, but had me curious as I was writing this up; when a user submits a job and specifies the number of nodes they want for their job (e.g |
@cmoussa1: does your jobtap plugin get a callback whenever each job gets allocated (i.e., when the scheduler reply |
@dongahn, this is a good suggestion. However, I'm not sure how this could be done without a race unless the priority plugin holds all jobs and only releases one at a time to the scheduler. Otherwise, while the priority plugin is fetching the R for the most recent job, the scheduler could allocate resources for the next job and put the user/bank combination over their node limit. In a high throughput scenario, it is possible that the limit could be exceeded by quite a bit. It seems to me that only the scheduler can reasonably apply this limit, since only it will know how many nodes would be allocated to the next job, and can put that job on hold if it would exceed the configured maximum (unless, as suggested above, the priority plugin holds all jobs but one for each user/bank combo) I may be missing some simpler solution though? |
@grondo: great point. Yes, unless we can augment Resource Allocation Protocol Version 1 RFC https://github.com/flux-framework/rfc/blob/master/spec_27.rst, it will be difficult to avoid a race. Sorry I didn't think through. Revising that prototype would require substantial changes. A better approach could be to introduce bank/user into fluxion and enforce the limits. It is hard to scope how much work this will require though. |
This issue is pretty general, and two of the three limits mentioned at the start of the issue ( |
If I need to instead open this issue in a different repository, please let me know and I'll move it.
The second issue that came up as a result of breaking up flux-sched #638 is the need to enforce per-user limits in the following ways:
Currently, these limits are not defined in the flux-accounting database, so a good first step would be to add additional fields for every association in the database:
I've just picked a default value of
5
for both limits, but we should probably come to a consensus as to what makes sense for a default value for users.We'll also need a way to pass this information where appropriate where these limits can be:
a) compared to a user's current usage both in terms of how many nodes they are using and how many jobs they have submitted, and
b) enforced if a user has reached either of these limits.
The text was updated successfully, but these errors were encountered: