-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible performance issue in sched.resource-status
RPC
#1137
Comments
Related #1039 |
Thanks for the helpful reproducer @grondo . The first thing I notice is that while Fluxion is certainly slower than sched-simple, the majority of the time (>55%) is spent outside of SCHEDULER NNODES T(sched.resource-status) T(resource.status)
sched-simple 128 0.097 0.091
sched-simple 256 0.101 0.086
sched-simple 512 0.118 0.090
sched-simple 1024 0.176 0.089
sched-simple 2048 0.337 0.086
sched-simple 4096 0.668 0.088
sched-simple 8192 1.374 0.089
sched-simple 16384 2.858 0.085
Elapsed times: R all 0.0302172 R down 0.0274029 R alloc 0.026943
fluxion 128 0.185 0.100
Elapsed times: R all 0.068368 R down 0.0651569 R alloc 0.061074
fluxion 256 0.395 0.088
Elapsed times: R all 0.119063 R down 0.103589 R alloc 0.102197
fluxion 512 0.750 0.099
Elapsed times: R all 0.230068 R down 0.208475 R alloc 0.198806
fluxion 1024 1.481 0.098
Elapsed times: R all 0.474467 R down 0.421471 R alloc 0.412955
fluxion 2048 3.038 0.090
Elapsed times: R all 0.899421 R down 0.845332 R alloc 0.832985
fluxion 4096 6.018 0.098
Elapsed times: R all 1.89333 R down 1.66922 R alloc 1.69468
fluxion 8192 12.195 0.094
Elapsed times: R all 3.67655 R down 3.43581 R alloc 3.28851
fluxion 16384 24.251 0.098
|
If I repeat the test twice in a row for each scheduler, the Fluxion SCHEDULER NNODES T(sched.resource-status) T(sched.resource-status) T(resource.status)
sched-simple 128 0.112 0.115 0.102
sched-simple 256 0.106 0.127 0.102
sched-simple 512 0.131 0.130 0.095
sched-simple 1024 0.201 0.167 0.111
sched-simple 2048 0.464 0.254 0.108
sched-simple 4096 0.703 0.406 0.114
sched-simple 8192 1.398 0.624 0.084
sched-simple 16384 2.820 1.185 0.093
fluxion 128 0.200 0.185 0.090
fluxion 256 0.399 0.329 0.100
fluxion 512 0.759 0.428 0.092
fluxion 1024 1.541 0.758 0.093
fluxion 2048 3.050 1.418 0.091
fluxion 4096 6.116 2.699 0.089
fluxion 8192 12.087 5.355 0.093
fluxion 16384 24.434 10.514 0.110 Most of the relative speedup is likely due to Fluxion not needing to initialize the second time. |
Here are the timings if I implement a cache on SCHEDULER NNODES T(sched.resource-status) T(sched.resource-status) T(resource.status)
sched-simple 128 0.096 0.095 0.083
sched-simple 256 0.100 0.098 0.086
sched-simple 512 0.120 0.123 0.095
sched-simple 1024 0.176 0.145 0.095
sched-simple 2048 0.373 0.225 0.084
sched-simple 4096 0.713 0.412 0.086
sched-simple 8192 1.380 0.651 0.088
sched-simple 16384 2.843 1.280 0.105
fluxion 128 0.180 0.116 0.090
fluxion 256 0.390 0.150 0.088
fluxion 512 0.754 0.198 0.088
fluxion 1024 1.519 0.298 0.092
fluxion 2048 3.050 0.537 0.103
fluxion 4096 5.927 0.910 0.102
fluxion 8192 12.168 1.771 0.089
fluxion 16384 24.197 3.412 0.095 There are probably some clever ways to cache |
After further thought, a much better way to get the allocated state is just to query the root SCHEDULER NNODES T(sched.resource-status) T(sched.resource-status) T(resource.status)
sched-simple 128 0.119 0.097 0.084
sched-simple 256 0.098 0.097 0.081
sched-simple 512 0.115 0.113 0.092
sched-simple 1024 0.174 0.142 0.085
sched-simple 2048 0.332 0.211 0.082
sched-simple 4096 0.772 0.380 0.088
sched-simple 8192 1.377 0.619 0.084
sched-simple 16384 2.712 1.238 0.085
fluxion 128 0.152 0.085 0.096
fluxion 256 0.329 0.084 0.081
fluxion 512 0.640 0.083 0.081
fluxion 1024 1.323 0.083 0.084
fluxion 2048 2.630 0.090 0.082
fluxion 4096 5.329 0.086 0.081
fluxion 8192 10.623 0.081 0.100
fluxion 16384 21.286 0.081 0.095 With The Flux config needs to specify pruning filters for the appropriate resources: [resource]
noverify = true
norestrict = true
[[resource.config]]
hosts = "test[1-${NNODES}]"
cores = "0-63"
gpus = "0-8"
[sched-fluxion-resource]
prune-filters="ALL:node,ALL:core,ALL:gpu" If it's not in use already, using pruning filters should be done in production as it will significantly accelerate matches and reservations. The remaining difficulty has to do with formatting the output of {"version":1,"execution":{"R_lite":[{"rank":"0-127","children":{"core":"0-63","gpu":"0-8"}}],"nodelist":["test[1-128]"],"starttime":0,"expiration":0}} @grondo do you have any suggestions for coercing an output like |
Nice result!
I do not think it is possible to convert aggregate counts like this into Rv1. |
That's true, there isn't any way to map the counts back to the allocated resources without a traversal. Unfortunately updating allocation-to-resource mappings will require the approximately 3 second search at 16K nodes every time the allocations change (which will be every scheduler loop). The two solutions I can think of right away are caching the allocation searches and then throttling the updates (e.g., updating mappings at most every X seconds). The other option would be to start a thread for Any other suggestions are most welcome. |
We should, of course, devote time to improving the performance of traversals. That's complementary to the two ideas and will be needed anyway. |
I analyzed the
and this comparison in the traverser:
If I short-circuit the searches and comparisons (because they will be SCHEDULER NNODES T(sched.resource-status) T(sched.resource-status) T(resource.status)
sched-simple 128 0.093 0.092 0.096
sched-simple 256 0.101 0.116 0.122
sched-simple 512 0.127 0.138 0.084
sched-simple 1024 0.206 0.147 0.081
sched-simple 2048 0.374 0.214 0.086
sched-simple 4096 0.674 0.397 0.100
sched-simple 8192 1.531 0.870 0.169
sched-simple 16384 3.701 1.776 0.142
fluxion 128 0.250 0.182 0.145
fluxion 256 0.410 0.172 0.128
fluxion 512 0.830 0.227 0.134
fluxion 1024 1.457 0.272 0.111
fluxion 2048 2.866 0.408 0.116
fluxion 4096 5.692 0.675 0.086
fluxion 8192 11.106 1.347 0.090
fluxion 16384 22.257 2.553 0.108 That's with a cache on |
If you think about the use case in a system instance, there are likely to be many users running I wonder if Fluxion could produce the I don't really understand the 20s initialization penalty here, Is it only a factor just after loading the module, or would there be a penalty after each scheduling loop changes the graph? Another idea would be for us to somehow move the resource-status RPC out of the scheduler module. The flux-core |
I think the validity time would probably need to be extended to 10s or beyond to be usable for hundreds of system users. Even at 10s that means about 25% of the scheduler time is absorbed by updating
I was thinking along those lines, too. Fluxion could accumulate
Sorry, my previous discussion wasn't very clear. The 20s times you see above include Fluxion initialization when loading the module plus the
I think this is a good idea, and I think would include assembling some sort of graph deltas that could be used to update |
I could put together a PR with the caching and optimizations I performed so far pretty quickly. Assembling graph deltas and moving the resource-status RPC out of Fluxion would have a bigger payoff but will take longer in my estimation. It will also supersede several of the changes I tried in this discussion. Do you think it's helpful to make a PR with the changes so far? |
I was thinking more of a service provided by flux-core which could operate on Rv1 directly. Since the
It definitely would be beneficial! |
I think a lot in terms of graphs, so what I meant was updating Rv1 to me is a graph transformation (of course that can be done without explicit graph transformations). I don't think dealing with the Fluxion Boost resource graph will work well here, and agree that operating on Rv1 directly is a good idea.
That sounds like a very good approach! I'm happy to help with this if needed. |
Even when I enable caching of all three outputs ( $ for i in {1..10}; do echo repetition $i; time -p flux resource list; done
repetition 1
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 11.05
user 8.54
sys 0.04
repetition 2
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 8.63
user 8.59
sys 0.03
repetition 3
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 8.78
user 8.74
sys 0.03
repetition 4
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 11.25
user 8.78
sys 0.03
repetition 5
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 8.96
user 8.92
sys 0.04
repetition 6
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 9.07
user 9.03
sys 0.03
repetition 7
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 11.34
user 8.80
sys 0.04
repetition 8
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 8.72
user 8.67
sys 0.04
repetition 9
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 9.22
user 9.17
sys 0.04
repetition 10
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 11.35
user 8.87
sys 0.04 As compared with the RPC which exhibits a maximum return time approximately equal to the $ for i in {1..11}; do echo repetition $i; time -p rpc "sched.resource-status"; sleep 2; done
repetition 1
real 2.57
user 0.09
sys 0.01
repetition 2
real 0.10
user 0.08
sys 0.01
repetition 3
real 0.09
user 0.08
sys 0.01
repetition 4
real 0.10
user 0.08
sys 0.01
repetition 5
real 0.09
user 0.07
sys 0.01
repetition 6
real 0.09
user 0.08
sys 0.01
repetition 7
real 0.11
user 0.09
sys 0.01
repetition 8
real 0.10
user 0.08
sys 0.02
repetition 9
real 0.11
user 0.09
sys 0.01
repetition 10
real 0.10
user 0.08
sys 0.01
repetition 11
real 2.64
user 0.08
sys 0.02 (Note the 2.64s time at the 11th iteration.) |
@grondo I forgot to highlight the considerable additional overhead involved in |
Yes, that is a known issue (though I'd forgotten how bad it was). The command is in Python and also has to go through gyrations to collect identical lines of output. I will try to open an issue if we don't have one already |
Since flux-framework/flux-core#5796 is being merged, solving this issue is not a priority. |
I haven't seen this issue recently. I'll close and we can reopen if necessary, or open a new issue. |
As mentioned in a meeting yesterday, during preparation for scale testing it was noted that
flux resource list
response times were excessively slow when Fluxion was the loaded scheduler module. To probe the performance of this RPC, the following script was developed which can test the response times of the RPCs used byflux resource list
at different node sizes. Results forsched-simple
are included as well for comparison:And here are the results:
I would expect that Fluxion should have a similar response time as sched-simple, but instead it is approximately 8x slower (and even 4s is probably too slow for production use). The other worry is that most of this ~30s, the resource module is busy assembling the response and cannot work on scheduling jobs. On a production system it could be common for many resource listing queries per second to be issued to the scheduler, so this could keep the module occupied for a large percentage of the time.
As a first cut, it might be useful to think about making a cached response to this RPC that could be sent immediately when valid.
The text was updated successfully, but these errors were encountered: