-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize Resolver #3627
Parallelize Resolver #3627
Conversation
Very cool! |
Tagging @konstin and @BurntSushi to review. |
0a03bca
to
411bdb4
Compare
411bdb4
to
33fd177
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty awesome. I think this largely makes sense to me overall. I am a little concerned about the switch to unbounded channels/streams though. I convinced us a while back to switch from unbounded channels to bounded channels. I believe this was my argument: #1163 (comment)
.map(|request| self.process_request(request).boxed_local()) | ||
// Allow as many futures as possible to start in the background. | ||
// Backpressure is provided by at a more granular level by `DistributionDatabase` | ||
// and `SourceDispatch`, as well as the bounded request channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this comment be unpacked a bit more? Also, which bounded request channel is this referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old comment, I didn't touch any of the fetch
code. It's referring to the channel between the prefetcher and solver, I'll update it to make it clearer.
484da3d
to
a447910
Compare
Love to see a 50% improvement :) |
pub struct PythonEnvironment(Arc<SharedPythonEnvironment>); | ||
|
||
#[derive(Debug, Clone)] | ||
struct SharedPythonEnvironment { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my knowledge, is this the typical naming scheme for this pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually use Foo
and FooInner
personally. I don't think I've seen SharedFoo
much? I like adding a suffix personally. So I'd prefer FooShared
(or whatever). But I don't have a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thanks! Inner
makes a bit more sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sort of avoid Inner
because it feels like a catch-all naming convention. A suffix seems slightly better for readability, I'll switch to that.
Amazing work! Do you know why it gets so much faster, i.e. how the solver is blocking? I've been looking at the spans, but i don't really understand why the prefetches don't get queued anyway on main. Airflow, main vs. PR: Here's some perf number from my machine:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it. I like the lifetime removal a lot too. Nice work.
@konstin The elapsed user time doesn't change up, and if you look at the profile for the resolver thread you'll see a lot of time spent in pubgrub, which suggests that the prefetches may have been queued on the single-threaded version but we simply didn't have enough time to get to them, or if we did they took away from the solver. My hunch is that the solver and prefetcher were fighting for time slices. |
## Summary Move completely off tokio's multi-threaded runtime. We've slowly been making changes to be smarter about scheduling in various places instead of depending on tokio's general purpose work-stealing, notably #3627 and #4004. We now no longer benefit from the multi-threaded runtime, as we run on all I/O on the main thread. There's one remaining instance of `block_in_place` that can be swapped for `rayon::spawn`. This change is a small performance improvement due to removing some unnecessary overhead of the multi-threaded runtime (e.g. spawning threads), but nothing major. It also removes some noise from profiles. ## Test Plan ``` Benchmark 1: ./target/profiling/uv (resolve-warm) Time (mean ± σ): 14.9 ms ± 0.3 ms [User: 3.0 ms, System: 17.3 ms] Range (min … max): 14.1 ms … 15.8 ms 169 runs Benchmark 2: ./target/profiling/baseline (resolve-warm) Time (mean ± σ): 16.1 ms ± 0.3 ms [User: 3.9 ms, System: 18.7 ms] Range (min … max): 15.1 ms … 17.3 ms 162 runs Summary ./target/profiling/uv (resolve-warm) ran 1.08 ± 0.03 times faster than ./target/profiling/baseline (resolve-warm) ```
Summary
This PR introduces parallelism to the resolver. Specifically, we can perform PubGrub resolution on a separate thread, while keeping all I/O on the tokio thread. We already have the infrastructure set up for this with the channel and
OnceMap
, which makes this change relatively simple. The big change needed to make this possible is removing the lifetimes on some of the types that need to be shared between the resolver and pubgrub thread.A related PR, #1163, found that adding
yield_now
calls improved throughput. With optimal scheduling we might be able to get away with everything on the same thread here. However, in the ideal pipeline with perfect prefetching, the resolution and prefetching can run completely in parallel without depending on one another. While this would be very difficult to achieve, even with our current prefetching pattern we see a consistent performance improvement from parallelism.This does also require reverting a few of the changes from #3413, but not all of them. The sharing is isolated to the resolver task.
Test Plan
On smaller tasks performance is mixed with ~2% improvements/regressions on both sides. However, on medium-large resolution tasks we see the benefits of parallelism, with improvements anywhere from 10-50%.