Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priority plugin work-in-progress #290

Merged
merged 2 commits into from
Feb 9, 2018
Merged

Priority plugin work-in-progress #290

merged 2 commits into from
Feb 9, 2018

Conversation

morrone
Copy link
Contributor

@morrone morrone commented Feb 9, 2018

This pull request preserves the priority plugin work-in-progress of Don and Chris for posterity.

lipari and others added 2 commits February 8, 2018 15:57
The new priority plugin can be loaded in addition to the existing fcfs
or backfill plugins.

With the introduction of the priority plugin, the existing "sched_plugin"
had to be renamed to the "behavior_plugin".  "sched_plugin" now refers to
any plugin loaded by the sched module.  The two categories of sched
plugins, with different API's, are the behavior and priority plugins.
This is a continuation (but not a completion) of the multi-factor
priority plugin that was initiated by Don Lipari.  Don retired
before there was time to complete the plugin.

The code was uplifted from C to C++.  The idea behind that change
is that some of the core scheduler code is also C++, and a native
C++ priority plugin would allow an API where the parts can directly
exchange STL data structures such as vectors and remain RAII through
out the process rather than dropping down to C and manual memory
management.

The parser for the account/user information from the SLURM sacct
command had numerous bugs.  Many are now fixed.  Also, the parser
and the rest of the code were improved to allow the sacct information
to be reloaded and adjusted online (previously the data was completly
static after initial read).

Some of the sacctmgr parsing bugs were:

Add better error handling for missing "|".  A missing "|" would have
resulted in the an entry containing information from the previous line.
Now at least the bad line will be ignored with and error.

The code failed to properly copy temporary strings.  That is also resolved.
prio_node_create() makes copies of its inputs.  Proper error handling is
added for memory allocation failures, and prio_node_destroy() is added
to clean up the structure.

Also we correct the code to allow the "if (user)" check to function.
Previously "user" was a pointer to an empty string, meaning that
the condition would never be false.  Now the parser checks for an
empty field and sets "user" to NULL rather than pointing to an
empty string.  If white space were ever to appear in the field,
this check would not function correctly.

Also, often in the code the term "child" was used when it really
meant "sibling".  Naming was changed to better reflect the behavior.

A simple first test command is added.

norm_usage/norm_shares were caluclated incorrectly.  That has been
fixed.

The code comment describing the algorithm did not really match
the algorithm.  That has been somewhat improved.

Under the "sched" directory make a new "priority" sub-directory.
The "priority" direcotry will contain the various priority plugins.
The first priority plugin is in the mod_fair_tree (Modified Fair Tree)
subdirectory.

The files in the plugin are first compiled into a libtool convenience
library which is in turn used to both build the shared library and
directly linked against by test applications.  Test applications can
include either of the headers, depending on testing needs.  The shared
library will only export the functions that appear in
the "priority_external_api.hxx" file, so use of the internal header
and the convenience library give additional symbol access for testing.

The implementation of the fair share algorithm is improved to begin
genericizing it from some of the odd sacctmgr designs (e.g. the
"parent" account changes column from either "Parent" or "Account"
based on the values of _other_ columns).

The recursion is cleaned up considerably.

Make it possible to add and remove accounts/users (WIP).  While
accounts can be removed (just load in a new set of data from sacctmgr
that has the account missing), there is perhaps more work to be
done here.  For instance, if a user has a job running and meanwhile
that user is removed from the job's current account, what needs to
happen?  Presumably we need to keep record of that user until all
jobs from that user are complete so that resource usage can correctly
be applied to the parent accounts up the tree.  So the user
perhaps the user needs to be marked deleted and no longer eligible
for new jobs, but kept around until all current jobs are complete.

Duplicate accounts are ignored.

operator<< functions are provided for debugging purposes.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.6%) to 75.958% when pulling 3818f8e on morrone:priority_wip into b16e2dd on flux-framework:master.

@codecov-io
Copy link

codecov-io commented Feb 9, 2018

Codecov Report

Merging #290 into master will decrease coverage by 0.49%.
The diff coverage is 44.28%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #290     +/-   ##
=========================================
- Coverage   74.69%   74.19%   -0.5%     
=========================================
  Files          49       49             
  Lines        9080     9176     +96     
=========================================
+ Hits         6782     6808     +26     
- Misses       2298     2368     +70
Impacted Files Coverage Δ
resrc/resrc.h 100% <ø> (ø) ⬆️
src/common/libutil/shortjansson.h 88.33% <100%> (ø) ⬆️
sched/plugin.c 55.64% <38.77%> (-14.02%) ⬇️
sched/sched.c 74.44% <56.09%> (-0.93%) ⬇️
sched/sched_backfill.c 90.19% <0%> (-0.66%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b16e2dd...3818f8e. Read the comment docs.

@morrone morrone merged commit a4dae42 into flux-framework:master Feb 9, 2018
@grondo grondo mentioned this pull request May 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants