-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up final constraint enforcement #1286
Speed up final constraint enforcement #1286
Conversation
problem: resource pools and relations don't have move constructors or move assignment solution: add them, defaulted, and remove the unnecessary explicit constructors
problem: we're accidentally hashing the unique_id twice in some cases, and using a very slow and not great hash for the pointer in the shared_ptr case solution: specialize the rc storage to more efficiently hash shared_ptr, and only compute the unique_id hash once
2eec4cd
to
dad5626
Compare
problem: scoring_api does a lot of extra std::map lookups that aren't really necessary, and uses a map where an interned_key_vec would do at least as well solution: swap the std::map for the interned_key_vec, and remove a whole lot of calls to handle new keys, which actually hadn't been necessary for a bit
problem: we have typedefs for common maps of subsystem to string or set of string, but we're using std::map solution: stop doing that
problem: we store a mapping of subsystems in every edge, but each edge is only ever in one subsystem, they're effectively an edge type, so it's a waste of time and space both in memory and in all our serialization formats that carry it like JGF solution: swap the std::map<string,subsystem_t> for a single subsystem_t, fix the breakage in all the readers and writers and fix all the JGFs in the repository
problem: the files _start_ with json but then have other output after that, so putting them through jq causes errors solution: use test_cmp instead
dad5626
to
cd63a6a
Compare
problem: In commit b96899b, as part of adding node exclusive, enforce_dfv was added to precisely mark unconstrained resources between the root of the containment tree and the constrained resources. This works fine, but it's a _full traversal of the graph_ that gets run even when selecting a single core. Usually this isn't terrible performance because if the Node type is constrained then all the node vertices get skipped after the constrained one is found. When the job is only constrained on a core or socket however, or if the node isn't found quickly, it can be as bad as a full traversal. Literally 90% of the time spent matching a small single core job in the perf test case in issue flux-framework#1171. solution: The short version is: Remove enforce_dfv completely. The longer version is that instead of traversing the entire graph, we now re-use the iteration over constrained resources in dfu_impl_t::enforce and walk upward from each constrained vertex in the dominant subsystem until we find a vertex that we've already set best_k (since any above that must already have been set). That way we walk the minimum amount of the tree necessary. Honestly it's still doing more work than it should, because we're still iterating over resources that are children of exclusive resources unnecessarily, and we're having to iterate over `in_edges` rather than directly knowing the parent or subsetting on subsystem, but fixing either of those would be a meaningful refactor so I'm stopping here for now with the 15-20x speedup. fixes flux-framework#1171
0d5065a
to
85d61d8
Compare
problem: several files have extra _seconds_ of compile time due to over-including headers in headers solution: include less things we do not need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Huge improvement here:
Submitting 100 jobs: │███████████████████████████████████████│100.0% 150.5 job/s
real 0m4.229s
user 0m0.063s
sys 0m0.008s
{
"V": 1654785,
"E": 1654784,
"by_rank": {
"[0-16383]": 101
},
"load-time": 75.918168234999996,
"graph-uptime": 40,
"time-since-reset": 116,
"match": {
"succeeded": {
"njobs": 100,
"njobs-reset": 100,
"max-match-jobid": 1950452023297,
"max-match-iters": 1,
"stats": {
"min": 0.022669928999999998,
"max": 0.028772064,
"avg": 0.027917891049999994,
"variance": 1.0197218638951509e-6
}
},
"failed": {
"njobs": 0,
"njobs-reset": 0,
"max-match-jobid": 0,
"max-match-iters": 0,
"stats": {
"min": 0.0,
"max": 0.0,
"avg": 0.0,
"variance": 0.0
}
}
}
}
One question, will this break existing JGF on clusters that have it, or if Fluxion is reloaded from an older version on some system?
Oh, and also does flux ion-R
need an update? (Sorry if I missed something obvious)
It's subtle, but the change for All the JGF handling in there is done by that other file, come to find out. It actually took me a shockingly long time to find that. As to the breaking existing JGF, it's possible if we're using JGF in something on those systems, which we might be in some cases. I don't see where though, it looks like we're just storing R for the R key on tioga for example, so I don't think there's anything to break there. This is part of why I asked about evolving JGF at the last meeting. If we need I can make it so the reader can ingest either the new or old format and just always turn it into this internal representation, but this will not be able to emit the old format without some loss. |
Ok, sounds good. Perhaps @jameshcorbett could comment if there are any systems actively using JGF (which I think he mentioned it shouldn't be an issue to upgrade) |
Problem: flux-framework/flux-sched/pull/1286 changed the way edges are represented in JGF. Update the expected JGF output of flux-dws2jgf.
Problem: flux-framework/flux-sched/pull/1286 changed the way edges are represented in JGF. Update the expected JGF output of flux-dws2jgf.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1286 +/- ##
=======================================
Coverage 75.3% 75.3%
=======================================
Files 111 110 -1
Lines 15361 15236 -125
=======================================
- Hits 11575 11482 -93
+ Misses 3786 3754 -32
|
This includes fixes for various performance issues in the scoring api type, and resource data types (it should even speed up graph creation by making the resource types movable). The main item though is below. It takes the performance of the benchmark from #1171 from 10 jobs/s for constrained jobs and 15 jobs/s for unconstrained to 100-150 jobs/s for constrained and 150-200 jobs/s for unconstrained. For some reason this patch also consistently hits the issue in the progress bar where it will reach past the end of style, so if that comes up make sure to be testing with flux-core master.
problem: In commit b96899b, as part of adding node exclusive,
enforce_dfv was added to precisely mark unconstrained resources between
the root of the containment tree and the constrained resources. This
works fine, but it's a full traversal of the graph that gets run even
when selecting a single core. Usually this isn't terrible performance
because if the Node type is constrained then all the node vertices get
skipped after the constrained one is found. When the job is only
constrained on a core or socket however, or if the node isn't found
quickly, it can be as bad as a full traversal. Literally 90% of the
time spent matching a small single core job in the perf test case in
issue #1171.
solution: The short version is: Remove enforce_dfv completely. The
longer version is that instead of traversing the entire graph, we now
re-use the iteration over constrained resources in dfu_impl_t::enforce
and walk upward from each constrained vertex in the dominant subsystem
until we find a vertex that we've already set best_k (since any above
that must already have been set). That way we walk the minimum amount
of the tree necessary. Honestly it's still doing more work than it
should, because we're still iterating over resources that are children
of exclusive resources unnecessarily, and we're having to iterate over
in_edges
rather than directly knowing the parent or subsetting onsubsystem, but fixing either of those would be a meaningful refactor so
I'm stopping here for now with the 15-20x speedup.
fixes #1171