-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job manager needs new interface for job "limits" #4309
Comments
I like the fact that with this design
The code that actually implements the limit wouldn't necessarily have to deal with the above. It wouldn't even have to parse the limit configuration if it's copied into the So it's a nice separation of concerns, if that could work. |
As pointed out by @garlick in #4430, I can confirm that the following limits described at the top of this issue are currently handled in flux-accounting and was added from the following PR's:
flux-framework/flux-accounting#201 was the PR that added a user max active jobs limit to the multi-factor priority plugin.
flux-framework/flux-accounting#131, flux-framework/flux-accounting#177, and flux-framework/flux-accounting#202 were PR's that all added support for enforcing a user max running jobs limit. Both of these limits are currently enforced on a per-user basis when the multi-factor priority plugin is loaded, so they should be marked as completed in #4431 (which I see is marked as completed! thanks @garlick!). Let me know if further confirmation or details are needed. :-) |
Thanks! |
As we discuss user job and queue limits (related #4302 #4306), we realized that it may be impossible to handle certain use cases for limits given current job manager design, assuming limits are evaluated in the
job.validate
plugin callback (as I believe is currently implemented in the job-accountingmf_priority.so
plugin).To review, there are a few classes of limits which could be handled within the job manager and thus via jobtap plugins, including:
The discussion here only relates to the first two bullets (at least for now). The mechanism for a scheduler to apply resource limits is outside of the scope of this issue.
A problem with the current approach of using existing jobtap plugin callbacks to implement these limits is that each plugin is operating in isolation, and therefore there is no way to implement a plugin that overrides either fatal or hold limits. One idea floated by @garlick is to treat limits like dependencies, with add/remove events that can add or remove specific limits. This could reuse the current DEPEND state, or perhaps we would want to add a new state specific to limits.
Instead of rejecting or holding jobs, plugins that are enforcing limits would instead add a fatal or nonfatal "limit" to the job. A plugin that overrides limits could remove limits that have been added up to that point in the plugin call stack (perhaps we could also allow a plugin to push an override event that clears even future limits, so that plugin order doesn't matter). After all plugin 'limit' callbacks have been called, the current state of limits is applied. If there are outstanding nonfatal limits then the job stays in the LIMIT (or DEPEND) state. The list of outstanding limits would be available via job listing utilities, similar to dependencies. One or more fatal limit events would cause the job to rejected.
This scheme allows a lot of flexibility in how limits, at least those limits which can be enforced by the job manager, are applied and removed. Mainly, it allows limits to be added and removed by separate plugins, and allows some insight into which limits may be holding up a job.
The text was updated successfully, but these errors were encountered: