Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize Resolver #3627

Merged
merged 11 commits into from
May 17, 2024
Merged

Parallelize Resolver #3627

merged 11 commits into from
May 17, 2024

Conversation

ibraheemdev
Copy link
Member

@ibraheemdev ibraheemdev commented May 16, 2024

Summary

This PR introduces parallelism to the resolver. Specifically, we can perform PubGrub resolution on a separate thread, while keeping all I/O on the tokio thread. We already have the infrastructure set up for this with the channel and OnceMap, which makes this change relatively simple. The big change needed to make this possible is removing the lifetimes on some of the types that need to be shared between the resolver and pubgrub thread.

A related PR, #1163, found that adding yield_now calls improved throughput. With optimal scheduling we might be able to get away with everything on the same thread here. However, in the ideal pipeline with perfect prefetching, the resolution and prefetching can run completely in parallel without depending on one another. While this would be very difficult to achieve, even with our current prefetching pattern we see a consistent performance improvement from parallelism.

This does also require reverting a few of the changes from #3413, but not all of them. The sharing is isolated to the resolver task.

Test Plan

On smaller tasks performance is mixed with ~2% improvements/regressions on both sides. However, on medium-large resolution tasks we see the benefits of parallelism, with improvements anywhere from 10-50%.

./scripts/requirements/jupyter.in
Benchmark 1: ./target/profiling/baseline (resolve-warm)
  Time (mean ± σ):      29.2 ms ±   1.8 ms    [User: 20.3 ms, System: 29.8 ms]
  Range (min … max):    26.4 ms …  36.0 ms    91 runs
 
Benchmark 2: ./target/profiling/parallel (resolve-warm)
  Time (mean ± σ):      25.5 ms ±   1.0 ms    [User: 19.5 ms, System: 25.5 ms]
  Range (min … max):    23.6 ms …  27.8 ms    99 runs
 
Summary
  ./target/profiling/parallel (resolve-warm) ran
    1.15 ± 0.08 times faster than ./target/profiling/baseline (resolve-warm)
./scripts/requirements/boto3.in   
Benchmark 1: ./target/profiling/baseline (resolve-warm)
  Time (mean ± σ):     487.1 ms ±   6.2 ms    [User: 464.6 ms, System: 61.6 ms]
  Range (min … max):   480.0 ms … 497.3 ms    10 runs
 
Benchmark 2: ./target/profiling/parallel (resolve-warm)
  Time (mean ± σ):     430.8 ms ±   9.3 ms    [User: 529.0 ms, System: 77.2 ms]
  Range (min … max):   417.1 ms … 442.5 ms    10 runs
 
Summary
  ./target/profiling/parallel (resolve-warm) ran
    1.13 ± 0.03 times faster than ./target/profiling/baseline (resolve-warm)
./scripts/requirements/airflow.in 
Benchmark 1: ./target/profiling/baseline (resolve-warm)
  Time (mean ± σ):     478.1 ms ±  18.8 ms    [User: 482.6 ms, System: 205.0 ms]
  Range (min … max):   454.7 ms … 508.9 ms    10 runs
 
Benchmark 2: ./target/profiling/parallel (resolve-warm)
  Time (mean ± σ):     308.7 ms ±  11.7 ms    [User: 428.5 ms, System: 209.5 ms]
  Range (min … max):   287.8 ms … 323.1 ms    10 runs
 
Summary
  ./target/profiling/parallel (resolve-warm) ran
    1.55 ± 0.08 times faster than ./target/profiling/baseline (resolve-warm)

@ibraheemdev ibraheemdev marked this pull request as ready for review May 16, 2024 16:17
@charliermarsh
Copy link
Member

Very cool!

@charliermarsh
Copy link
Member

Tagging @konstin and @BurntSushi to review.

@charliermarsh charliermarsh added the performance Potential performance improvement label May 16, 2024
@ibraheemdev ibraheemdev force-pushed the prefetch-spawn branch 2 times, most recently from 0a03bca to 411bdb4 Compare May 16, 2024 16:36
Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty awesome. I think this largely makes sense to me overall. I am a little concerned about the switch to unbounded channels/streams though. I convinced us a while back to switch from unbounded channels to bounded channels. I believe this was my argument: #1163 (comment)

crates/uv-resolver/src/resolver/batch_prefetch.rs Outdated Show resolved Hide resolved
crates/uv-resolver/src/resolver/mod.rs Outdated Show resolved Hide resolved
crates/uv-resolver/src/resolver/mod.rs Outdated Show resolved Hide resolved
crates/uv-resolver/src/resolver/mod.rs Outdated Show resolved Hide resolved
.map(|request| self.process_request(request).boxed_local())
// Allow as many futures as possible to start in the background.
// Backpressure is provided by at a more granular level by `DistributionDatabase`
// and `SourceDispatch`, as well as the bounded request channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment be unpacked a bit more? Also, which bounded request channel is this referring to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old comment, I didn't touch any of the fetch code. It's referring to the channel between the prefetcher and solver, I'll update it to make it clearer.

crates/uv-resolver/src/resolver/mod.rs Outdated Show resolved Hide resolved
@zanieb
Copy link
Member

zanieb commented May 17, 2024

Love to see a 50% improvement :)

Comment on lines 16 to 19
pub struct PythonEnvironment(Arc<SharedPythonEnvironment>);

#[derive(Debug, Clone)]
struct SharedPythonEnvironment {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my knowledge, is this the typical naming scheme for this pattern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually use Foo and FooInner personally. I don't think I've seen SharedFoo much? I like adding a suffix personally. So I'd prefer FooShared (or whatever). But I don't have a strong opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks! Inner makes a bit more sense to me.

Copy link
Member Author

@ibraheemdev ibraheemdev May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of avoid Inner because it feels like a catch-all naming convention. A suffix seems slightly better for readability, I'll switch to that.

@konstin
Copy link
Member

konstin commented May 17, 2024

Amazing work!

Do you know why it gets so much faster, i.e. how the solver is blocking? I've been looking at the spans, but i don't really understand why the prefetches don't get queued anyway on main.

Airflow, main vs. PR:

main

PR


Here's some perf number from my machine:

jupyter:
  Time (mean ± σ):      15.9 ms ±   1.3 ms    [User: 14.1 ms, System: 21.0 ms]
  Time (mean ± σ):      15.4 ms ±   1.1 ms    [User: 14.1 ms, System: 20.1 ms]
boto3:
  Time (mean ± σ):     383.5 ms ±   3.3 ms    [User: 343.5 ms, System: 60.8 ms]
  Time (mean ± σ):     325.3 ms ±   3.8 ms    [User: 351.7 ms, System: 62.0 ms]
airflow:
  Time (mean ± σ):     192.6 ms ±   3.9 ms    [User: 172.5 ms, System: 151.3 ms]
  Time (mean ± σ):     147.0 ms ±   3.3 ms    [User: 168.4 ms, System: 146.6 ms]

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. I like the lifetime removal a lot too. Nice work.

crates/uv-resolver/src/resolver/index.rs Outdated Show resolved Hide resolved
crates/uv-resolver/src/resolver/index.rs Show resolved Hide resolved
@ibraheemdev
Copy link
Member Author

ibraheemdev commented May 17, 2024

@konstin The elapsed user time doesn't change up, and if you look at the profile for the resolver thread you'll see a lot of time spent in pubgrub, which suggests that the prefetches may have been queued on the single-threaded version but we simply didn't have enough time to get to them, or if we did they took away from the solver. My hunch is that the solver and prefetcher were fighting for time slices.

@ibraheemdev ibraheemdev merged commit 39af09f into astral-sh:main May 17, 2024
44 checks passed
ibraheemdev added a commit that referenced this pull request Jul 9, 2024
## Summary

Move completely off tokio's multi-threaded runtime. We've slowly been
making changes to be smarter about scheduling in various places instead
of depending on tokio's general purpose work-stealing, notably
#3627 and
#4004. We now no longer benefit from
the multi-threaded runtime, as we run on all I/O on the main thread.
There's one remaining instance of `block_in_place` that can be swapped
for `rayon::spawn`.

This change is a small performance improvement due to removing some
unnecessary overhead of the multi-threaded runtime (e.g. spawning
threads), but nothing major. It also removes some noise from profiles.

## Test Plan

```
Benchmark 1: ./target/profiling/uv (resolve-warm)
  Time (mean ± σ):      14.9 ms ±   0.3 ms    [User: 3.0 ms, System: 17.3 ms]
  Range (min … max):    14.1 ms …  15.8 ms    169 runs
 
Benchmark 2: ./target/profiling/baseline (resolve-warm)
  Time (mean ± σ):      16.1 ms ±   0.3 ms    [User: 3.9 ms, System: 18.7 ms]
  Range (min … max):    15.1 ms …  17.3 ms    162 runs
 
Summary
  ./target/profiling/uv (resolve-warm) ran
    1.08 ± 0.03 times faster than ./target/profiling/baseline (resolve-warm)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants