-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yield after channel send and move cpu tasks to thread #1163
Conversation
@@ -408,14 +411,16 @@ impl<'a, Provider: ResolverProvider> Resolver<'a, Provider> { | |||
} | |||
} | |||
} | |||
// Yield after sending on a channel to allow the subscribers to continue | |||
tokio::task::yield_now().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to explain what this is doing / why it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel sending is sync, so my theory is that the task doesn't yield until all the work for a certain section (e.g. (pre-)fetch all version maps) is done and thereby blocks the receiving end from taking over (the requests and the solver are both on the main thread as far as i can tell). I've confirmed that there is a gap between e.g. a prefetch request being sent and it being received/executed.
You can see this in the span blocks for warm cache jupyter on main:
We shouldn't have to wait till all the simple api requests are done to start the metadata requests.
In comparison, on this branch everything happens concurrently:
The spans are also shorter, which makes sense when they were previously only missing the chance to complete because another task was blocking the thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the OP with better plots and a proper description how to read them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@konstin Can you move your lovely comment into the source here? Obviously the images won't be able to make it in, but even just the text would be very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try with bounded channels, seems straightforward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@konstin - I defer to you to try this out, but seems like it could be a good solution if it also lets us remove the yields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BurntSushi Why is a bounded channel better, and what would a reasonable channel size be? Put differently, what the advantage stopping the producer, given that we know our problem is finite and reasonably sized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have anything concrete to say in the context of puffin other than the hand-wavy "it provides back-pressure." As for why back-pressure is important, I'll point to the following:
- https://lucumr.pocoo.org/2020/1/1/async-pressure/
- https://medium.com/@jayphelps/backpressure-explained-the-flow-of-data-through-software-2350b3e77ce7
- https://ferd.ca/handling-overload.html
That is, the point of bounded channels isn't really that they "stop," but rather, that they react to things downstream that have gotten clogged. Namely, if a bounded channel stops, that's usually only a symptom of either a capacity that is too small or something downstream that has gotten overloaded. In the former case, we fix it by tuning the capacity. In the latter case, well, it's a feature and not a bug that the bounded channel has stopped sending more input into something that is already overloaded.
As for what a reasonable capacity would be... I'm not sure. It's not that dissimilar from Vec::with_capacity
. Usually starting with a small (or even zero) capacity is good enough. But one can experimentally arrive at a different figure. Maybe start with the number of cores? (Or just hard-code 16, since that's probably only a little more than what most folks have, and a number smaller than the number of cores doesn't mean that not all cores will be used.) Just a guess.
Basically, my position is the bounded channels should be the default, because they---hand-wavily---lead to overall better holistic properties of the system. The provide a means for back-pressure. Unbounded channels have no mechanism for back-pressure, and I think they are generally an anti-pattern unless there is some specific motivation for them.
Also, a nice historical anecdote. ("async" channels are unbounded and "sync" channels are bounded.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to tokio::sync::mpsc::channel
with a bound size of 50.
0f15008
to
b347a00
Compare
b347a00
to
253b79a
Compare
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite. Current dependencies on/for this PR:
This stack of pull requests is managed by Graphite. |
253b79a
to
83cc4d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The home-assistant
improvement is epic.
@@ -408,14 +411,16 @@ impl<'a, Provider: ResolverProvider> Resolver<'a, Provider> { | |||
} | |||
} | |||
} | |||
// Yield after sending on a channel to allow the subscribers to continue | |||
tokio::task::yield_now().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@konstin Can you move your lovely comment into the source here? Obviously the images won't be able to make it in, but even just the text would be very helpful.
@@ -408,14 +411,16 @@ impl<'a, Provider: ResolverProvider> Resolver<'a, Provider> { | |||
} | |||
} | |||
} | |||
// Yield after sending on a channel to allow the subscribers to continue | |||
tokio::task::yield_now().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still quite weird to me that you need an explicit yield_now()
here. Basically, it should be just like a std::thread::yield_now()
, which means it should be incredibly rare to use it.
Have you tried using tokio's bounded channels instead? Its send
operation is async: https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.html#method.send
(I would generally rather we use bounded channels with some kind of reasonable capacity everywhere, since they provide back-pressure. But it's hard to point to concrete specifics here, and is kind of hand-wavy.)
Woah, 2.5x? That's insane. I didn't even notice that. |
2ab1949
to
887fc3f
Compare
@konstin - I see a 1.5x speedup for me, how is your warm
Surprisingly, I don't really see any speedup for the cold benchmarks. |
83cc4d0
to
55ea90a
Compare
No idea, but the difference remains after rebase with more runs. Could this be an os scheduler or fs performance thing? # From konsti/yield-after-send with unambiguous `gt up`
$ gt down && cargo build --profile profiling && mv target/profiling/puffin-dev target/profiling/main-dev && mv target/profiling/puffin target/profiling/main && gt up && cargo build --profile profiling && mv target/profiling/puffin-dev target/profiling/branch-dev && mv target/profiling/puffin target/profiling/branch
$ hyperfine --warmup 3 --runs 100 "target/profiling/main pip compile scripts/requirements/home-assistant.in" "target/profiling/branch pip compile scripts/requirements/home-assistant.in"
Benchmark 1: target/profiling/main pip compile scripts/requirements/home-assistant.in
Time (mean ± σ): 225.0 ms ± 2.0 ms [User: 195.7 ms, System: 61.5 ms]
Range (min … max): 222.6 ms … 238.0 ms 100 runs
Benchmark 2: target/profiling/branch pip compile scripts/requirements/home-assistant.in
Time (mean ± σ): 92.7 ms ± 5.6 ms [User: 302.4 ms, System: 174.8 ms]
Range (min … max): 80.7 ms … 110.9 ms 100 runs
Summary
target/profiling/branch pip compile scripts/requirements/home-assistant.in ran
2.43 ± 0.15 times faster than target/profiling/main pip compile scripts/requirements/home-assistant.in I'd be interested how the differences between main and this PR look for other people! |
Cold runs fluctuate between 1000ms and 2000ms, they are far too volatile for me to do any comparative benchmarking. The span structure looks very similar main and this PR, so i expect no change. You can see that we often have to wait for fetching the metadata of a specific version to start prefetches, the network idling in between. While the request parallelism seems to be well used in general (This is good!), i also see room for improvement, for lower gains though than the original. Proposal: We try to speculate about the versions we would pick for other packages and prefetch as if we the solver was deciding on whichever package arrived first (instead of in a deterministic order, as we actually do), speculating that the resolution in arbitrary order is the same as the one in deterministic order. The solver itself remains deterministic, if the guess was wrong we only made an unnecessary request; we could consider cancelling them if they turned out to be wrong. The are different designs, such as sharing the partial solution or forking the pubgrub state with non-deterministic decision and rebasing them whenever deterministic decisions are made. An interesting property would be to prefetch packages deep in the dependency tree: Let us define jupyter as a dependency of order 0, and jupyter requiring notebook makes notebook an order 1 dependency, then (given i did the graph traversal right) pycparser is a order 6 dep and types-python-dateutil (elsewhere in the tree) an order 7 dep (Fig. 4). The branch and main examples both have to wait on pycparser as last and second to last choose version case, in the right side end of the plot where we're not doing much except waiting for those packages (Fig. 1 and 2). If we could prefetch in a way that those deps can get fetch earlier, we could avoid this nearly-idle tail. This becomes even more apparent with I find it surprising that we spend way more time waiting for metadata than for versions, even though the average duration of simple api and metadata requests would have expect it the other way round. I guess this is because there are more version maps already present when we need them? For An easy option is to review the size of the Fig 1: Cold main Fig 2: Cold branch Fig 3: Refresh branch Fig 4: Dependency graph |
@@ -5,7 +5,9 @@ use std::sync::Arc; | |||
|
|||
use anyhow::Result; | |||
use dashmap::{DashMap, DashSet}; | |||
use futures::channel::mpsc::UnboundedReceiver; | |||
use futures::channel::mpsc::{ | |||
UnboundedReceiver as MpscUnboundedReceiver, UnboundedSender as MpscUnboundedSender, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove this rename? Either use the names directly, or the fully-qualified names? I just find this a little more confusing as a reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make them consistently fully qualified (there's too many slightly different channels)
I'm seeing a 1.5x improvement:
Pretty awesome! I don't have any hypotheses for the discrepancy. @konstin How many CPU cores do you have? If you have a lot, and since this PR is improving parallelism efficiency, that might explain why you're seeing a greater improvement. (Although I kind of have a lot of cores too. I have 16 physical cores and 24 logical cores (i.e., 8 of my physical cores are hyper-threaded.) |
The actual raw numbers are kind of crazy to me though, like Konsti's is 92.7ms! |
Thank you @BurntSushi, that's the answer! I'm using an i9-13900K with 8 performance cores, 16 efficiency cores and 32 threads. Limiting the number of cores reduces the speedup (output cropped for readability): $ taskset -c 0-3 hyperfine --warmup 3 "target/profiling/main pip compile scripts/requirements/home-assistant.in" "target/profiling/branch pip compile scripts/requirements/home-assistant.in"
1: Time (mean ± σ): 229.8 ms ± 1.4 ms [User: 198.1 ms, System: 60.4 ms]
2: Time (mean ± σ): 170.9 ms ± 6.1 ms [User: 335.1 ms, System: 127.7 ms]
target/profiling/branch pip compile scripts/requirements/home-assistant.in ran
1.34 ± 0.05 times faster than target/profiling/main pip compile scripts/requirements/home-assistant.in
$ taskset -c 0-7 hyperfine --warmup 3 "target/profiling/main pip compile scripts/requirements/home-assistant.in" "target/profiling/branch pip compile scripts/requirements/home-assistant.in"
1: Time (mean ± σ): 224.4 ms ± 1.3 ms [User: 194.0 ms, System: 56.6 ms]
2: Time (mean ± σ): 112.6 ms ± 5.7 ms [User: 301.4 ms, System: 140.0 ms]
target/profiling/branch pip compile scripts/requirements/home-assistant.in ran
1.99 ± 0.10 times faster than target/profiling/main pip compile scripts/requirements/home-assistant.in
$ taskset -c 0-15 hyperfine --warmup 3 "target/profiling/main pip compile scripts/requirements/home-assistant.in" "target/profiling/branch pip compile scripts/requirements/home-assistant.in"
1: Time (mean ± σ): 222.4 ms ± 1.5 ms [User: 191.7 ms, System: 56.8 ms]
2: Time (mean ± σ): 88.6 ms ± 1.9 ms [User: 276.0 ms, System: 137.6 ms]
target/profiling/branch pip compile scripts/requirements/home-assistant.in ran
2.51 ± 0.06 times faster than target/profiling/main pip compile scripts/requirements/home-assistant.in
$ taskset -c 0-31 hyperfine --warmup 3 "target/profiling/main pip compile scripts/requirements/home-assistant.in" "target/profiling/branch pip compile scripts/requirements/home-assistant.in"
1: Time (mean ± σ): 226.3 ms ± 1.8 ms [User: 197.7 ms, System: 59.3 ms]
2: Time (mean ± σ): 90.4 ms ± 5.4 ms [User: 294.2 ms, System: 174.1 ms]
target/profiling/branch pip compile scripts/requirements/home-assistant.in ran
2.50 ± 0.15 times faster than target/profiling/main pip compile scripts/requirements/home-assistant.in |
55ea90a
to
8d7ad2d
Compare
## Summary This PR introduces parallelism to the resolver. Specifically, we can perform PubGrub resolution on a separate thread, while keeping all I/O on the tokio thread. We already have the infrastructure set up for this with the channel and `OnceMap`, which makes this change relatively simple. The big change needed to make this possible is removing the lifetimes on some of the types that need to be shared between the resolver and pubgrub thread. A related PR, #1163, found that adding `yield_now` calls improved throughput. With optimal scheduling we might be able to get away with everything on the same thread here. However, in the ideal pipeline with perfect prefetching, the resolution and prefetching can run completely in parallel without depending on one another. While this would be very difficult to achieve, even with our current prefetching pattern we see a consistent performance improvement from parallelism. This does also require reverting a few of the changes from #3413, but not all of them. The sharing is isolated to the resolver task. ## Test Plan On smaller tasks performance is mixed with ~2% improvements/regressions on both sides. However, on medium-large resolution tasks we see the benefits of parallelism, with improvements anywhere from 10-50%. ``` ./scripts/requirements/jupyter.in Benchmark 1: ./target/profiling/baseline (resolve-warm) Time (mean ± σ): 29.2 ms ± 1.8 ms [User: 20.3 ms, System: 29.8 ms] Range (min … max): 26.4 ms … 36.0 ms 91 runs Benchmark 2: ./target/profiling/parallel (resolve-warm) Time (mean ± σ): 25.5 ms ± 1.0 ms [User: 19.5 ms, System: 25.5 ms] Range (min … max): 23.6 ms … 27.8 ms 99 runs Summary ./target/profiling/parallel (resolve-warm) ran 1.15 ± 0.08 times faster than ./target/profiling/baseline (resolve-warm) ``` ``` ./scripts/requirements/boto3.in Benchmark 1: ./target/profiling/baseline (resolve-warm) Time (mean ± σ): 487.1 ms ± 6.2 ms [User: 464.6 ms, System: 61.6 ms] Range (min … max): 480.0 ms … 497.3 ms 10 runs Benchmark 2: ./target/profiling/parallel (resolve-warm) Time (mean ± σ): 430.8 ms ± 9.3 ms [User: 529.0 ms, System: 77.2 ms] Range (min … max): 417.1 ms … 442.5 ms 10 runs Summary ./target/profiling/parallel (resolve-warm) ran 1.13 ± 0.03 times faster than ./target/profiling/baseline (resolve-warm) ``` ``` ./scripts/requirements/airflow.in Benchmark 1: ./target/profiling/baseline (resolve-warm) Time (mean ± σ): 478.1 ms ± 18.8 ms [User: 482.6 ms, System: 205.0 ms] Range (min … max): 454.7 ms … 508.9 ms 10 runs Benchmark 2: ./target/profiling/parallel (resolve-warm) Time (mean ± σ): 308.7 ms ± 11.7 ms [User: 428.5 ms, System: 209.5 ms] Range (min … max): 287.8 ms … 323.1 ms 10 runs Summary ./target/profiling/parallel (resolve-warm) ran 1.55 ± 0.08 times faster than ./target/profiling/baseline (resolve-warm) ```
Summary
Previously, we were blocking operations that could run in parallel. We would send request through our main requests channel, but not yield so that the receiver could only start processing requests much later than necessary. We solve this by switching to the async
tokio::sync::mpsc::channel
, where send is an async functions that yields.Due to the increased parallelism cache deserialization and the conversion from simple api request to version map became bottlenecks, so i moved them to
spawn_blocking
. Together these result in a 30-60% speedup for larger warm cache resolution. Small cases such as black already resolve in 5.7 ms on my machine so there's no speedup to be gained, refresh and no cache were to noisy to get signal from.Note for the future: Revisit the bounded channel if we want to produce requests from
process_request
, too, (this would be good for prefetching) to avoid deadlocks.Details
We can look at the behavior change through the spans:
Below, you can see how on main, we have discrete phases: All (cached) simple api requests in parallel, then all (cached) metadata requests in parallel, repeat until done. The solver is mostly waiting until it has it's version map from the simple API query to be able to choose a version. The main thread is blocked by process requests.
In the PR branch, the simple api requests succeeds much earlier, allowing the solver to advance and also to schedule more prefetching. Due to that
parse_cache
andfrom_metadata
became bottlenecks, so i moved them off the main thread (green color, and their spans can now overlap because they can run on multiple threads in parallel). The main thread isn't blocked onprocess_request
anymore, instead it has frequent idle times. The spans are all much shorter, which indicates that on main they could have finished much earlier, but a task didn't yield so they weren't scheduled to finish (though i haven't dug deep enough to understand the exact scheduling between the process request stream and the solver here).main
PR
Benchmarks