Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: add max_active_jobs per-user/bank limit #201

Merged
merged 6 commits into from
Feb 7, 2022

Conversation

cmoussa1
Copy link
Member

@cmoussa1 cmoussa1 commented Feb 2, 2022

Problem

The changes introduced in #177 left somewhat of a "bad behavior" loophole in the priority plugin when it comes to handling job submissions from a user/bank row. As a quick refresher, #177 introduced a change to the way the plugin kept track of a running jobs for a user/bank row, so it only counts jobs that had entered the RUN state. So, for example, if a user/bank row had a max_jobs limit of 2, and they submitted 4 jobs at once, the first two jobs could enter the RUN state, but the other 2 would have dependencies added to them (thus, holding the jobs), until one of the first two jobs entered the INACTIVE state, which would remove the dependency on one of the held jobs and allow it to enter the RUN state.

However, there is currently no way to enforce the amount of active jobs from a user/bank row. So, technically, someone with a max_jobs limit of 2 could submit 100 jobs at once, leaving 2 jobs to run at a time while 98 others sit in the queue. This behavior should be prevented and enforced for every user/bank combination.


This [WIP] PR introduces a new user/bank row limit to both the association_table in the flux-accounting database and to the multi-factor priority plugin, called max_active_jobs. This limit represents the maximum number of jobs able to be pending + running. This limit can be set at create time for a row and can be modified at any time for a row in the association_table.

In the priority plugin, this limit is integrated into every bank_info struct for every user/bank. Its limit is enforced when a job enters job.validate. If the number of pending + running jobs reaches the max_active_jobs limit, subsequently submitted jobs will be rejected with a message saying that the limit has been reached, which I believe lines up with current behavior on our systems.

This counter is incremented as a job exits job.validate and is decremented as a job enters job.state.inactive.

Additional tests were added to t1005-max-jobs-limits.t to test the new limit by submitting a max number of jobs and ensuring an error message is output for jobs submitted while the max_active_jobs limit is reached.

Fixes #196

@cmoussa1 cmoussa1 added new feature new feature high priority items that must be worked on for major milestones labels Feb 2, 2022
shares int(11) DEFAULT 1 NOT NULL ON CONFLICT REPLACE DEFAULT 1,
job_usage real DEFAULT 0.0 NOT NULL,
fairshare real DEFAULT 0.5 NOT NULL,
max_jobs int(11) DEFAULT 5 NOT NULL ON CONFLICT REPLACE DEFAULT 5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I know this is still a draft, but I have one suggestion: is max_jobs actually max_running_jobs? If so, it might be nice to rename now rather than later since max_jobs and max_active_jobs could be easily confused (at least for me). By renaming max_jobs to max_running_jobs or max_runnable_jobs or whatever, it would make it clear throughout which limit applies to the total number of active jobs vs just runnable jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're one step ahead of me! I was pushing that work off until I got the max_active_jobs named throughout the project; changing max_jobs to max_running_jobs or something is still left to do (and I agree it's probably a necessary change) and will most likely result in a number of changes throughout the project as well, not just in the plugin. Sorry, I probably should have written that as a TODO in the PR description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I apologize for jumping the gun!

@cmoussa1 cmoussa1 force-pushed the add.max-active-jobs branch 2 times, most recently from 10359d8 to 728cbf1 Compare February 3, 2022 16:40
@cmoussa1
Copy link
Member Author

cmoussa1 commented Feb 3, 2022

OK, I think I'll take this out of [WIP] and mark it as ready for an initial review whenever convenient.

I prepended the set of commits that actually add the max_active_jobs limits with a series of commits that rename max_jobs to max_running_jobs throughout the flux-accounting project: in the DB, the Python bindings and commands, the existing sharness tests, and the priority plugin. I tried to separate them where I felt appropriate, but if you think they should just be squashed into one larger "renaming" commit, I'm totally fine with that and can squash.

@cmoussa1 cmoussa1 marked this pull request as ready for review February 3, 2022 17:03
@cmoussa1 cmoussa1 changed the title [WIP] plugin: add max_active_jobs per-user/bank limit plugin: add max_active_jobs per-user/bank limit Feb 3, 2022
@grondo
Copy link
Contributor

grondo commented Feb 3, 2022

If it would make more sense, you could propose the renaming as a separate PR. This might help users browsing the release notes since we have an entry for each PR, generally.

One reason to squash the rename into one big commit is so that no one commit breaks the build/test of the project. I would leave it up to your discretion whether you think there is value in keeping the commits separate. (I would personally lean toward just one commit if all the commit messages are largely the same)

@cmoussa1
Copy link
Member Author

cmoussa1 commented Feb 3, 2022

OK! I like that suggestion. I'll squash those commits and propose them in a separate PR. Thanks!

@cmoussa1 cmoussa1 force-pushed the add.max-active-jobs branch 2 times, most recently from 5be3007 to f51fef3 Compare February 7, 2022 15:54
@cmoussa1
Copy link
Member Author

cmoussa1 commented Feb 7, 2022

Just rebased to catch up after #202 and #203.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Nice work.

Just some minor suggestions inline, and one possible bug (though I may have read the code wrong, apologies if so!)

@@ -77,6 +77,12 @@ def add_add_user_arg(subparsers):
default=5,
metavar="MAX_RUNNING_JOBS",
)
subparser_add_user.add_argument(
"--max-active-jobs",
help="max number of both submitted and running jobs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/submitted/pending/?

@@ -128,6 +134,12 @@ def add_edit_user_arg(subparsers):
default=None,
metavar="MAX_RUNNING_JOBS",
)
subparser_edit_user.add_argument(
"--max-active-jobs",
help="max number of both submitted and running jobs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/submitted/pending/?

// if a user/bank has reached their max_active_jobs limit, subsequently
// submitted jobs will be rejected
if (max_run_jobs > 0 && cur_active_jobs >= max_active_jobs)
return flux_jobtap_reject_job (p, args, "user has max submitted jobs");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/submitted/active/?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, thanks for pointing this out. I just squashed and pushed up a change to fix the use of submitted --> active and submitted --> pending, where appropriate.


// if a user's fairshare value is 0, that means they shouldn't be able
// to run jobs on a system
if (fairshare == 0)
return flux_jobtap_reject_job (p, args, "user fairshare value is 0");

// if a user/bank has reached their max_active_jobs limit, subsequently
// submitted jobs will be rejected
if (max_run_jobs > 0 && cur_active_jobs >= max_active_jobs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

if (max_active_jobs > 0 && cur_active_jobs >= max_active_jobs)

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, you're absolutely right. This check should be for max_active_jobs. I just squashed and pushed up a change that fixes this if condition.

test_debug "cat max_active_jobs.out" &&
grep "user has max submitted jobs" max_active_jobs.out &&
flux job cancel $jobid1 &&
flux job cancel $jobid2 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: In one of these tests ensure that a user that has hit max_active_jobs limit can submit a new job after one of the active jobs has been canceled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this suggestion! Thanks. I just squashed and pushed up a change to this test file where I only cancel one job initially, and submit a new job and ensure that it does not get rejected since the user is now under their max_active_jobs limit. Let me know what you think.

Add a new field to the association_table, max_active_jobs,
which represents the maximum number of pending+running jobs
a user/bank combo can have at any given time.
Add max_active_jobs to the payload data that is sent from
the flux-accounting DB to the priority plugin.
Rename current_jobs to cur_run_jobs and max_running_jobs
to max_run_jobs to better describe a user/bank row's
"running" jobs count in preparation for adding an "active"
jobs count to the plugin.
Integrate the max_active_jobs field for user/bank rows
into the priority plugin. Enforce its value as a limit for a
user/bank row; if the number of pending+running jobs reaches
this limit, subsequently submitted jobs will be rejected with
a message saying that the max_active_jobs limit is reached.

The active counter for this limit is incremented as a job is exiting
job.validate and is decremented as a job enters job.state.inactive.
Add basic tests for enforcing the max_active_jobs limit
for a user/bank row.
@cmoussa1 cmoussa1 force-pushed the add.max-active-jobs branch from f51fef3 to 1e72edc Compare February 7, 2022 17:02
@codecov
Copy link

codecov bot commented Feb 7, 2022

Codecov Report

Merging #201 (ff4fc3a) into master (ff1d321) will increase coverage by 0.11%.
The diff coverage is 100.00%.

❗ Current head ff4fc3a differs from pull request most recent head 1e72edc. Consider uploading reports for the commit 1e72edc to get more accurate results

@@            Coverage Diff             @@
##           master     #201      +/-   ##
==========================================
+ Coverage   83.76%   83.87%   +0.11%     
==========================================
  Files          23       23              
  Lines        1004     1011       +7     
==========================================
+ Hits          841      848       +7     
  Misses        163      163              
Impacted Files Coverage Δ
src/plugins/mf_priority.cpp 88.03% <100.00%> (+0.76%) ⬆️

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM now!

@cmoussa1
Copy link
Member Author

cmoussa1 commented Feb 7, 2022

Woohoo! Thanks for the feedback and review. Setting MWP

@mergify mergify bot merged commit 565535e into flux-framework:master Feb 7, 2022
@cmoussa1 cmoussa1 deleted the add.max-active-jobs branch February 7, 2022 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority items that must be worked on for major milestones merge-when-passing new feature new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plugin: enforce a max_submit_jobs limit
2 participants