-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plugin: add max_active_jobs per-user/bank limit #201
Conversation
shares int(11) DEFAULT 1 NOT NULL ON CONFLICT REPLACE DEFAULT 1, | ||
job_usage real DEFAULT 0.0 NOT NULL, | ||
fairshare real DEFAULT 0.5 NOT NULL, | ||
max_jobs int(11) DEFAULT 5 NOT NULL ON CONFLICT REPLACE DEFAULT 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I know this is still a draft, but I have one suggestion: is max_jobs
actually max_running_jobs
? If so, it might be nice to rename now rather than later since max_jobs
and max_active_jobs
could be easily confused (at least for me). By renaming max_jobs
to max_running_jobs
or max_runnable_jobs
or whatever, it would make it clear throughout which limit applies to the total number of active jobs vs just runnable jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're one step ahead of me! I was pushing that work off until I got the max_active_jobs
named throughout the project; changing max_jobs
to max_running_jobs
or something is still left to do (and I agree it's probably a necessary change) and will most likely result in a number of changes throughout the project as well, not just in the plugin. Sorry, I probably should have written that as a TODO in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I apologize for jumping the gun!
10359d8
to
728cbf1
Compare
OK, I think I'll take this out of [WIP] and mark it as ready for an initial review whenever convenient. I prepended the set of commits that actually add the |
If it would make more sense, you could propose the renaming as a separate PR. This might help users browsing the release notes since we have an entry for each PR, generally. One reason to squash the rename into one big commit is so that no one commit breaks the build/test of the project. I would leave it up to your discretion whether you think there is value in keeping the commits separate. (I would personally lean toward just one commit if all the commit messages are largely the same) |
OK! I like that suggestion. I'll squash those commits and propose them in a separate PR. Thanks! |
5be3007
to
f51fef3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Nice work.
Just some minor suggestions inline, and one possible bug (though I may have read the code wrong, apologies if so!)
src/cmd/flux-account.py
Outdated
@@ -77,6 +77,12 @@ def add_add_user_arg(subparsers): | |||
default=5, | |||
metavar="MAX_RUNNING_JOBS", | |||
) | |||
subparser_add_user.add_argument( | |||
"--max-active-jobs", | |||
help="max number of both submitted and running jobs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/submitted/pending/?
src/cmd/flux-account.py
Outdated
@@ -128,6 +134,12 @@ def add_edit_user_arg(subparsers): | |||
default=None, | |||
metavar="MAX_RUNNING_JOBS", | |||
) | |||
subparser_edit_user.add_argument( | |||
"--max-active-jobs", | |||
help="max number of both submitted and running jobs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/submitted/pending/?
src/plugins/mf_priority.cpp
Outdated
// if a user/bank has reached their max_active_jobs limit, subsequently | ||
// submitted jobs will be rejected | ||
if (max_run_jobs > 0 && cur_active_jobs >= max_active_jobs) | ||
return flux_jobtap_reject_job (p, args, "user has max submitted jobs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/submitted/active/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, thanks for pointing this out. I just squashed and pushed up a change to fix the use of submitted
--> active
and submitted
--> pending
, where appropriate.
src/plugins/mf_priority.cpp
Outdated
|
||
// if a user's fairshare value is 0, that means they shouldn't be able | ||
// to run jobs on a system | ||
if (fairshare == 0) | ||
return flux_jobtap_reject_job (p, args, "user fairshare value is 0"); | ||
|
||
// if a user/bank has reached their max_active_jobs limit, subsequently | ||
// submitted jobs will be rejected | ||
if (max_run_jobs > 0 && cur_active_jobs >= max_active_jobs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
if (max_active_jobs > 0 && cur_active_jobs >= max_active_jobs)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, you're absolutely right. This check should be for max_active_jobs
. I just squashed and pushed up a change that fixes this if
condition.
test_debug "cat max_active_jobs.out" && | ||
grep "user has max submitted jobs" max_active_jobs.out && | ||
flux job cancel $jobid1 && | ||
flux job cancel $jobid2 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: In one of these tests ensure that a user that has hit max_active_jobs limit can submit a new job after one of the active jobs has been canceled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this suggestion! Thanks. I just squashed and pushed up a change to this test file where I only cancel one job initially, and submit a new job and ensure that it does not get rejected since the user is now under their max_active_jobs
limit. Let me know what you think.
Add a new field to the association_table, max_active_jobs, which represents the maximum number of pending+running jobs a user/bank combo can have at any given time.
Add max_active_jobs to the payload data that is sent from the flux-accounting DB to the priority plugin.
Rename current_jobs to cur_run_jobs and max_running_jobs to max_run_jobs to better describe a user/bank row's "running" jobs count in preparation for adding an "active" jobs count to the plugin.
Integrate the max_active_jobs field for user/bank rows into the priority plugin. Enforce its value as a limit for a user/bank row; if the number of pending+running jobs reaches this limit, subsequently submitted jobs will be rejected with a message saying that the max_active_jobs limit is reached. The active counter for this limit is incremented as a job is exiting job.validate and is decremented as a job enters job.state.inactive.
Add basic tests for enforcing the max_active_jobs limit for a user/bank row.
f51fef3
to
1e72edc
Compare
Codecov Report
@@ Coverage Diff @@
## master #201 +/- ##
==========================================
+ Coverage 83.76% 83.87% +0.11%
==========================================
Files 23 23
Lines 1004 1011 +7
==========================================
+ Hits 841 848 +7
Misses 163 163
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! LGTM now!
Woohoo! Thanks for the feedback and review. Setting MWP |
Problem
The changes introduced in #177 left somewhat of a "bad behavior" loophole in the priority plugin when it comes to handling job submissions from a user/bank row. As a quick refresher, #177 introduced a change to the way the plugin kept track of a running jobs for a user/bank row, so it only counts jobs that had entered the RUN state. So, for example, if a user/bank row had a
max_jobs
limit of 2, and they submitted 4 jobs at once, the first two jobs could enter the RUN state, but the other 2 would have dependencies added to them (thus, holding the jobs), until one of the first two jobs entered the INACTIVE state, which would remove the dependency on one of the held jobs and allow it to enter the RUN state.However, there is currently no way to enforce the amount of active jobs from a user/bank row. So, technically, someone with a
max_jobs
limit of 2 could submit 100 jobs at once, leaving 2 jobs to run at a time while 98 others sit in the queue. This behavior should be prevented and enforced for every user/bank combination.This
[WIP]PR introduces a new user/bank row limit to both theassociation_table
in the flux-accounting database and to the multi-factor priority plugin, calledmax_active_jobs
. This limit represents the maximum number of jobs able to be pending + running. This limit can be set at create time for a row and can be modified at any time for a row in theassociation_table
.In the priority plugin, this limit is integrated into every
bank_info
struct for every user/bank. Its limit is enforced when a job entersjob.validate
. If the number of pending + running jobs reaches themax_active_jobs
limit, subsequently submitted jobs will be rejected with a message saying that the limit has been reached, which I believe lines up with current behavior on our systems.This counter is incremented as a job exits
job.validate
and is decremented as a job entersjob.state.inactive
.Additional tests were added to
t1005-max-jobs-limits.t
to test the new limit by submitting a max number of jobs and ensuring an error message is output for jobs submitted while themax_active_jobs
limit is reached.Fixes #196