per-queue user limits #402

ryanday36 · 2023-12-01T16:52:58Z

we've had a request for more limits that we can set on specific queues. We can currently set limits on how many resources a specific job can use (max nodes, min nodes, wall time). We'd also like to set limits on how many resources a specific user's running jobs can use in a given queue. Specifically, we'd like to be able set the following for a queue:

max running jobs per user
max nodes per user (across all running jobs, similar to #349)

I've also been giving some thought to whether some sort of 'max committed node-hours per user' (total nodes*requested walltime of all of a users running jobs) would be useful. I'm not so sure about that though.

Tasks

Give feedback

support enforcing a max running jobs limit per association in a specific queue
support enforcing a max node count limit across an association's running jobs in a specific queue
Options

cmoussa1 · 2023-12-01T19:20:03Z

Thanks for opening this @ryanday36. To the best of my knowledge, I think flux-accounting at the moment is most capable of enforcing the max running jobs per-user in a given queue. It already enforces a max running jobs limit per-user across all of their jobs, so I think enforcing it per queue would be reasonable. (mostly thinking out loud here) This would entail:

adding a max running jobs limit column in the queue_table; a column that specifies how many running jobs a user can have in this queue at a given moment
storing this information in the priority plugin
checking the queue the job is submitted under when it reaches job.state.depend
looking at the number of running jobs the user already has in this queue (i wonder if there is a convenient way to fetch this information at the moment? If not, perhaps flux-accounting needs to keep track of job IDs per-queue or something, similar to how it holds job IDs for all held jobs per-user?)
if at this max running jobs limit for the queue, add a job dependency to the job with a "max-running-jobs in queue" title (or something descriptive)
when a currently running job in this queue reaches job.state.inactive, remove the dependency on the first submitted job that was held because of a "max-running-jobs in queue" limit

This includes the assumption that the max-running-jobs limit is the same for all users in any given queue, i.e a queue has a max-running-jobs limit of 5 jobs, so that means all users in this queue have a max-running-jobs limit of 5 jobs. Do I have this assumption correct?

@grondo not sure if you have any suggestions on my thought process outlined above of how implementing this might work, or if I made any dumb mistakes above and forgot to include something, but any feedback/suggestions here would be welcome. :-)

cmoussa1 · 2023-12-01T19:23:54Z

max nodes per user (across all running jobs, similar to #349)

I could also be wrong here, but I believe enforcing a max nodes per-user limit, both per-queue and in general, would require some coordination between flux-accounting and other Flux components, as nicely summarized by @grondo in a comment in #349:

Thanks @cmoussa1. Re-reading above it seems like the current summary is:

as a first cut, implement holistic limits in flux-accounting instead of a max-nodes limit. I.e. impose a max-nodes+max-cores limit for users across all jobs

as a prerequisite, the accounting plugin will need access to the actual resource counts assigned jobs, so that nnodes for core-only requests and ncores for nodes-only requests can be accounted. Therefore jobtap: add allocated resource information in job.state.run callbacks flux-core#3851 should be solved first.

Edit: Forgot to mention that more design work is needed on how to loop the scheduler into these limits so that a real max-nodes limit could be imposed.

Also, there is probably a race condition we should consider if the flux-accounting jobtap plugin is enforcing these liimits. E.g. a max-nodes limit could be exceeded or hit during one job's job.stat.run callback, while at the same time the scheduler is allocating more nodes to that user before the plugin has a chance to hold all pending jobs for the user.

It has been a little while since we've discussed this, though, so perhaps we are better suited to tackle this now than before.

cmoussa1 added new feature new feature plugin related to the multi-factor priority plugin labels Dec 1, 2023

cmoussa1 mentioned this issue Oct 3, 2024

plugin: add enforcement of max running jobs limit for a queue per-association #491

Open

cmoussa1 added feature tracking Tracking issue for larger feature made up of smaller issues and removed new feature new feature plugin related to the multi-factor priority plugin labels Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

per-queue user limits #402

per-queue user limits #402

ryanday36 commented Dec 1, 2023 •

edited by cmoussa1

Loading

Tasks

cmoussa1 commented Dec 1, 2023

cmoussa1 commented Dec 1, 2023

per-queue user limits #402

per-queue user limits #402

Comments

ryanday36 commented Dec 1, 2023 • edited by cmoussa1 Loading

Tasks

cmoussa1 commented Dec 1, 2023

cmoussa1 commented Dec 1, 2023

ryanday36 commented Dec 1, 2023 •

edited by cmoussa1

Loading