PERF: proof of concept, parallel installation of wheels #12816
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello,
This is a proof of concept to install wheels in parallels.
Mostly to demonstrate that it's possible.
That took a whole 10 minutes of ChatGPT to parallelize your code :D (and a whole week of debugging pip for other performance issues to understand how to go about it).
I don't expect that to be merged, however if somebody wants to add a proper optional argument
--parallel-install=N
, it should be possible to merge.There are two ways to go about parallelizing the pip installation:
It's pretty easy to do with a ThreadPool. It could be a done with a ProcessPool for a lot more performance improvements, but it's problematic to manage errors with sub processes and I'm not sure fork/processpool are implemented in all systems that pip has to support.
So let's investigate what gain you can get with a simple ThreadPool :)
Turns out, there isn't much to get because of the GIL, unless your filesystem has very high latency (NFS).
You get a small gain with 2 threads, usually no more improvement with 3 threads, it gets worse with 4+ threads.
I note there are multiple critical bugs in pip that makes the extraction very slow and inefficient and hold the GIL.
There should be room for more improvements after all these PRs are merged:
pending PR: #12803
pending PR: #12782
pending PR for download, don't try to parallelize downloads without that fix #12810
some benchmarks below on different types of disks:
above: run on NFS, NFS is high latency and the gains are substantial
above: run on two different disks, /tmp that should be in memory or kind of, and a local volume that is block storage on a virtual machine.
above: run with larger packages to compare.
tensorflow is 1-2 GB extracted
pandas+numpy numba+llvmlite scikitlearn are around 50-100 MB each.
I do note that pip seems to install packages alphabetically, which is not ideal.
tensorflow (and torch) fall toward the very end and I'd rather they start toward the beginning. A lot of the extraction time is waiting at the end for the final package tensorflow to finish extracting. It would complete sooner if it started sooner.