Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enforcing per-user limits #121

Closed
cmoussa1 opened this issue May 13, 2021 · 15 comments
Closed

enforcing per-user limits #121

cmoussa1 opened this issue May 13, 2021 · 15 comments
Assignees
Labels
high priority items that must be worked on for major milestones new feature new feature

Comments

@cmoussa1
Copy link
Member

If I need to instead open this issue in a different repository, please let me know and I'll move it.

The second issue that came up as a result of breaking up flux-sched #638 is the need to enforce per-user limits in the following ways:

  • max number of nodes used across all of a user's running jobs
  • max number of running jobs (in practice, the "max number of nodes used across all of a user's running jobs" limit is used)
  • max number of submitted jobs per user

Currently, these limits are not defined in the flux-accounting database, so a good first step would be to add additional fields for every association in the database:

max_nodes           int(11)     DEFAULT 5   NOT NULL,
max_submitted_jobs  int(11)     DEFAULT 5   NOT NULL

I've just picked a default value of 5 for both limits, but we should probably come to a consensus as to what makes sense for a default value for users.

We'll also need a way to pass this information where appropriate where these limits can be:

a) compared to a user's current usage both in terms of how many nodes they are using and how many jobs they have submitted, and
b) enforced if a user has reached either of these limits.

@grondo
Copy link
Contributor

grondo commented May 13, 2021

Max submitted jobs per-user could be enforced by a jobtap plugin (either the multifactor fairshare plugin or a standalone plugin). The plugin can keep a count of currently active jobs per user and reject any job that would exceed the maximum in the job.validate callback.

@cmoussa1 cmoussa1 self-assigned this Jun 30, 2021
@cmoussa1 cmoussa1 added high priority items that must be worked on for major milestones new feature new feature labels Aug 3, 2021
@cmoussa1
Copy link
Member Author

cmoussa1 commented Sep 1, 2021

#131 has now landed, so this issue is halfway to being completed. An active jobs count is tracked for users in the flux-accounting DB and will reject any job that would exceed the maximum. Now what is left is to add a max_nodes limit, which is the max number of nodes used across all of a user/bank combination's running jobs.

@cmoussa1
Copy link
Member Author

cmoussa1 commented Sep 7, 2021

I briefly started to take a look at the requirements for adding a max_nodes limit today and I figured I'd write some initial thoughts down so I don't forget them later:

  • thanks to the great and detailed feedback I got in plugin: add a per-user/bank max jobs limit #131, adding a max_nodes should be fairly straightforward since that PR did a lot of the leg work to create a bank_info struct for each user/bank combination in the flux-accounting DB. After adding a max_nodes column in the DB, I can add a new item to the struct (similarly named max_nodes) and send/receive it with the other user/bank information via flux_jobtap_job_aux_set/get ().
  • the increment/decrement mechanism for max_nodes should also be similar to the way max_jobs is enforced in the plugin - it will keep a running count of used nodes per user, and when a job is finished, it will decrement however many nodes were allocated. If a user is already at their limit, then we should hold any newly submitted jobs that would exceed this limit until they are once again under their limit.
  • my initial understanding of how to extract allocated node information per user would be through the jobspec of a job - is this the right approach? I did some playing around in the Docker container today to see how to get this kind of information and that is where I started; if I submit a job that only specifies the number of cores, like the following job:
$ flux mini submit -n 1 hostname

then jobspec looks like this

{
  "resources": [
    {
      "type": "slot",
      "count": 1,
      "with": [
        {
          "type": "core",
          "count": 1
        }
      ],
    }
  ],
}

If I submit a job that specifies a number of nodes, like the following job:

$ flux mini submit -N 1 hostname

then the jobspec looks like:

{
  "resources": [
    {
      "type": "node",
      "count": 1,
      "with": [
        {
          "type": "slot",
          "count": 1,
          "with": [
            {
              "type": "core",
              "count": 1
            }
          ],
          "label": "task"
        }
      ]
    }
  ],
}

Would a correct approach be to try and extract a count of nodes if it exists in jobspec? Maybe this is a dumb question and there is a different and better way to find out the number of nodes allocated for a job.

One note I think worth mentioning is that I think currently, when a user submits a job without specifying a number of nodes, like the first example I provided, the whole node is still allocated.

EDIT (9/8): still a couple open-ended questions to consider from today's coffee hour:

  • node-exclusive scheduling in Flux
  • "estimating resources" when looking at jobspec?
  • potentially get information from scheduler that modifies jobspec after a job is submitted?

@grondo
Copy link
Contributor

grondo commented Sep 8, 2021

Would a correct approach be to try and extract a count of nodes if it exists in jobspec? Maybe this is a dumb question and there is a different and better way to find out the number of nodes allocated for a job.

Unfortunately, you will have to fetch the R for the job in order to count the number of allocated nodes.
This is what @jameshcorbett had to do in his cray-libpals port distributor plugin.

Since there are two use cases already for counting the number of allocated resources from a jobtap plugin, I wonder if there is something Flux can do to help here.

One thing we could do is standardize some kind of resource summary in the alloc response from the scheduler. Maybe just to make a count of resources of each type available to the job-manager and its plugins? That would save a round trip to the KVS and some processing of R for the flux-accounting and cray plugins. (Not to mention they'd each be fetching R separately when used together!)

Edit: another solution would be to just have the scheduler respond with R, and have the job-manager do the kvs_put(), or have the scheduler respond with a copy of R. I'll open an issue in flux-core to get some feedback on the various solutions.

@grondo
Copy link
Contributor

grondo commented Sep 8, 2021

One other question: How does the max_nodes limit work? Is the intent to attempt to reject job requests that exceed the max_nodes limit, or is it a limit on the total number of nodes currently in use by a user (in which case this support might have to be implemented in the scheduler after all)

@cmoussa1
Copy link
Member Author

cmoussa1 commented Sep 8, 2021

One other question: How does the max_nodes limit work? Is the intent to attempt to reject job requests that exceed the max_nodes limit, or is it a limit on the total number of nodes currently in use by a user (in which case this support might have to be implemented in the scheduler after all)

Good question. I would assume we would want to enforce the same behavior as the max_jobs limit to keep things consistent. Right now, jobs submitted that would exceed the max_jobs limit are rejected upon submission, so if a job is submitted that would exceed the max_nodes limit, that job would be rejected.

@grondo
Copy link
Contributor

grondo commented Sep 8, 2021

That might be a difficult limit to enforce in the same way as max_jobs, since the number of nodes that will eventually be allocated to a pending job request is not known at the time of submission.

@jameshcorbett
Copy link
Member

jameshcorbett commented Sep 8, 2021

Because I was tagged earlier I'll throw in my two bits.

One other question: How does the max_nodes limit work? Is the intent to attempt to reject job requests that exceed the max_nodes limit, or is it a limit on the total number of nodes currently in use by a user (in which case this support might have to be implemented in the scheduler after all)

I am pretty sure that there are systems on LC that have both a maximum on nodes per job, and a maximum on nodes per user across all active jobs---I could try to find examples if desired.

I haven't checked whether the max_jobs limit is a limit on submitted or active jobs and the conversation has been a little confusing---but if you had a limit on active jobs per bank/user that would give you an OK replacement for a limit on nodes per bank/user across all active jobs: max_nodes * max_active_jobs.

the number of nodes that will eventually be allocated to a pending job request is not known at the time of submission.

So it sounds like a jobtap plugin that waits for the alloc event or run state would be necessary?

I think there would be a lot of users, experienced with Slurm or LSF, who would be extremely frustrated to have to wait to have their Flux jobs rejected due to invalid resource requirements. It would help a lot if there were a submission-time jobspec validator that would reject mini batch/mini alloc jobs with invalid --nnodes counts.

That front-end validation could end up covering almost every job if there were a wrapper around the mini batch/mini alloc utilities that only accepted --nnodes N and then translated it to --nslots M --nnodes N -c L. I suspect a wrapper like that would get a lot of use.

@grondo
Copy link
Contributor

grondo commented Sep 8, 2021

Yes, @jameshcorbett you are correct. This problem is much more tractable when we solve the problem of node-centric scheduling in Flux (and that is probably where a max_nodes limit makes the most sense). Then any jobtap plugin can keep a running total of requested nodes per bank/user/whatever, and reject jobs immediately that exceed that limit.

@cmoussa1
Copy link
Member Author

cmoussa1 commented Mar 2, 2022

It has been a little while since I have gotten the chance to circle back to this, but now that node-exclusive scheduling capability has landed in flux-sched, I think it might be good for me to start work on tackling the max_nodes per-user limit.

As a refresher to myself, this limit would enforce the max number of nodes that can be used across all of a user/bank combo's running jobs. I know earlier in this thread it was mentioned that the number of nodes that will eventually be allocated to a pending job request is not known at the time of submission. Does that change with the introduction of node-exclusive scheduling?

@cmoussa1
Copy link
Member Author

I had an offline conversation with @ryanday36 about the current behavior for enforcing a max_nodes per-user/bank combo limit, and he mentioned that if a user/bank combo reaches their max_nodes limit, future jobs that are submitted will get held until already running job(s) complete and releases the nodes they were using.

In another conversation with @grondo, he mentioned that a jobtap plugin will not have knowledge of how many nodes are eventually allocated to a job when it is first submitted, which might make it difficult for the plugin itself to enforce this limit. Perhaps we need a way to send this max_nodes limit information to the scheduler, which presumably has knowledge of how many nodes are allocated to a single job? @dongahn: do you have any thoughts on how we might be able to enforce such a limit?

I can say for the flux-accounting side, sending information to its required destination (in this case, maybe the scheduler) via RPC should be relatively straightforward, since it already sends similar information the priority plugin.


One side question that is not really related to the above, but had me curious as I was writing this up; when a user submits a job and specifies the number of nodes they want for their job (e.g flux mini submit -N 2 hostname), would a jobtap plugin then know in advance how many nodes would be allocated for said job?

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

@cmoussa1: does your jobtap plugin get a callback whenever each job gets allocated (i.e., when the scheduler reply sched.alloc with success? If so, perhaps you can tap into the R associated with each job and fetch the node count from there? This could be as simple as count the number of nodes in the nodelist key of R?

@grondo
Copy link
Contributor

grondo commented Mar 17, 2022

@dongahn, this is a good suggestion. However, I'm not sure how this could be done without a race unless the priority plugin holds all jobs and only releases one at a time to the scheduler. Otherwise, while the priority plugin is fetching the R for the most recent job, the scheduler could allocate resources for the next job and put the user/bank combination over their node limit. In a high throughput scenario, it is possible that the limit could be exceeded by quite a bit.

It seems to me that only the scheduler can reasonably apply this limit, since only it will know how many nodes would be allocated to the next job, and can put that job on hold if it would exceed the configured maximum (unless, as suggested above, the priority plugin holds all jobs but one for each user/bank combo)

I may be missing some simpler solution though?

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

@grondo: great point. Yes, unless we can augment Resource Allocation Protocol Version 1 RFC https://github.com/flux-framework/rfc/blob/master/spec_27.rst, it will be difficult to avoid a race. Sorry I didn't think through.

Revising that prototype would require substantial changes.

A better approach could be to introduce bank/user into fluxion and enforce the limits. It is hard to scope how much work this will require though.

@cmoussa1
Copy link
Member Author

This issue is pretty general, and two of the three limits mentioned at the start of the issue (max-running-jobs and max-submitted-jobs) are both enforced by flux-accounting's jobtap plugin, so I think I should close this issue and open a more specific one about adding support for enforcing a max-nodes limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority items that must be worked on for major milestones new feature new feature
Projects
Development

No branches or pull requests

4 participants