-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
path_to_url called millions of times for ~1000 offline wheel installs #12320
Comments
Some questions (treating this purely as an optimisation question, rather than specifically focusing on
I'd be interested to know how much benefit you'd get from implementing these two ideas. And a question for context - reading the Home Assistant website, it seems that the recommended installation approach is to use a prebuilt environment ("Home Assistant Operating System" or "Home Assistant Container"). Installing with pip ("Home Assistant Core") is described as an "advanced installation process" - which seems like a very reasonable way of viewing the options. The Home Assistant Core approach installs just the homeassistant package, letting pip do the dependency resolution. So I'm not 100% clear how realistic the "installing pre-resolved wheels offline" scenario is in practice. Is it a step used when preparing the container or OS forms of HA? If so, then presumably that's a relatively rare exercise? To reiterate, I'm not at all against performance improvements for pip, and I'm fine with fixing "low hanging fruit" no matter how it gets identified. But if we're going to focus on performance fixes in general, I'd prefer working on improvements that give benefits to the majority of our users, rather than ones that focus unusual or very specific use cases. For this particular example, I'd expect that the "normal" case of installing thousands of dependencies would involve the resolver and downloading multiple potential candidates (i.e., much like the HA Core approach) and therefore the savings from improving |
Also if the intent is to basically sidestep all of pip Why not Just use https://installer.pypa.io/en/stable/ |
Then please reply in issue #12314 which is the real world use case, this issue was opened to address a specific performance bottleneck, as I explained here: #12314 (comment)
Because I am most closely trying to reproduce #12314 and it would be good for Pip to still validate the dependencies are correct.
Well firstly because doing so was the first way I could reproduce the what OP said in #12314 that Secondly, because this is just a specific issue about one of the performance bottlenecks I found trying to reproduce the issue described in #12314 and isn't meant to address the larger picture. Thirdly because I am unsure what the alternative is that you mean, I can not pass ~1000 wheels on to the command line.
My understanding is #12314 is about the steps to build the containter. But I am anecdotally aware of users who role their own container and therefore need to go through the build process themselves.
Well in an ideal world you presumably want to rebuild a container any time new requirements are released and run your tests, which I assume with over 1000 requirements would be near constantly.
There are an increasing number of applications that are using a lot of dependenciess, WinPython, Airflow, SuperSet, etc. These all probably are in the 300 to 500 range, but there is no reason users can not depend on multiple of these probjects. And most projects are moving, or being encouraged to move, to wheels rather than sdists. So I would be happy to see performance improvements in Pip in this area.
I was going to raise other issues around this, I beleive looking at the callstack when I use network instead of local install I see similiar issues where similiar savings could be made. But in #12314 the user is clearly seeing most the time is not being spent on the network. However I started with what I saw as the simplest and least constrovertial gain, to test the waters on how receptive Pip maintainers would be to this approach. I was not expecting to provide this level of justification until I got into some of the more complex issues of trying to cache the return results of packaging. As this issue is clean and reproducible and the PR is a double digits % win in those reproducible steps for very little cost or complexity. |
I am not, Pip is critical for resolving and checking the resolution here. |
Fair enough. My personal view is that I see massive dependency graphs like this as the exception rather than the rule, and while I'm happy enough to improve things for these use cases, I'd rather see evidence that improvements are of benefit (or at least neutral) in the normal case (which I consider to be up to an order of magnitude of maybe 100 dependencies in one install). Performance is a case where "fixes" are notorious for improving one area at the cost of others, and so I'm skeptical of results that don't show the effect across the full range of use cases we support. I won't comment any further on the Home Assistant use case. I assume they have decided for themselves that the approach they are taking is the best one for them, and I have no experience to allow me to say otherwise. But the one thing I will say is that they are definitely pushing the limits of scale as far as pip is concerned, and that's something they have to take into account, whether we find performance improvements or not. Having said all of this, you have definitely nerd-sniped me into looking at this. I've made an alternative suggestion in #12322 which you might like to try out. |
Well, probably they should move to rip once it is released, I have already been helping rip by testing it against lots of real world complex dependency graphs and reporting various issues from the results. But personally I would like Pip to able to handle the ~1000 dependencies scenario smoothly in almost all use cases, as I see this as the future for a lot of large Python applications (or at least in the 500-1000 range). And I'm not quite sure what it says about Python if in it's future all of it's supporting infrastructure is written in Rust. |
That's quite a fair point, unfortunately there's no Pip performance benchmark suite that can be leveraged or contributed to. In that sense I do understand the carefulness of reviewing performance improvements. |
I'm curious - do you have evidence of this? My experience is that there are a lot of Python applications, but smaller ones that you install via pip or pipx usually have few dependencies, and larger ones either bundle their dependencies, or otherwise avoid the whole "install a bunch of dependencies into a virtual environment" experience for their end user installs - because virtual environment management sucks as an end user experience. The only really big applications I know of are Apache Airflow, and Home Assistant, and both of these seem to have their own unique challenges. That's why I like the HA approach of having container and OS builds. Those sound like much better user experiences than "install this 1000-dependency package into a virtual environment". |
Not right now on a Satuday morning while brewing my first coffee, but I will try and put something together
WinPython and Apache Superset have at least as many dependencies as Apache Airflow, there are definetly a lot more, I will make notebook and try and provide more definetivie evidence in the future. I have seen several probjects put Apache Airflow as a dependency, but right now I don't have a good way to search for them (I did think there was a website that let you search Python dependencies globally but I can't seem to find it).
Well this reduces the problem of installing all these packages from being a user problem to a developer problem, which is better but not exactly solved, if the problem becomes too difficult for the developer than the process breaks anyway. Further it only works when a user is using these Python projects as a standalone application, if the user wants to use them as a Python library, modify the container to work in a different environment, or many other things they want to modify, they still have to install everything anyway. |
No worries, I'm not trying to put pressure on you here. I'm just surprised that our experiences differ so significantly. Not least, we get (as far as I know) few if any issues raised by these larger projects apart from Airflow and (occasionally) HA. I don't honestly know what to make of that information, but I do find it interesting. As I've said, my personal interest is very much more in the area of making pip fast and convenient for small installs that are run frequently - disposable virtual environments, tools like tox, nox and pip-run, etc. I'd rather the packaging community found a better way to build and distribute applications than making some sort of "master" package with huge numbers of dependences that get installed via pip, and that we encouraged shallower, wider, dependency trees - but I think I'm a lone voice in the wilderness on that subject, unfortunately. |
But isn't a big problem here that we are using Pip as the main installer and Pip does not guarantee dependencies between installs? e.g. Once we move to a world where Pip is considered low level install machinery and we are mainly using a tool that manages our environment rather than just out individual installs, won't it be a lot easier to seperate out to smaller packages and not worry that a user will break their environment when they install something? |
For applications, that should be solved by pinning, surely? But we're very off-topic here and I don't have any experience managing a big application like this, so let's drop this line of discussion. |
You've asked me a clarifying question and asked to drop this line of discussion, I wll do both in that order. It's a problem for applications because if an application could be split into sub-applications they can not guarantee the user will get the correct dependencies. e.g. So in this sense Pip encourages large monolithic applications with giant dependency trees. |
As I was investigating a similar issue, I also throwed py-spy at this scenario. We can indeed see |
Yes, I beleive these are two different |
Description
The fastest way, and has been advised as the best way, to install Python packages is to use wheels.
However when you actually try to install ~1000 wheels from a single set of requirements there are significant performance bottlenecks in Pip
Expected behavior
Installing pre-resolved wheels offline should be very very fast
pip version
23.3.dev0
Python version
Python 3.11
OS
Linux
How to Reproduce
python3.11 -m venv .venv
source .venv/bin/activate
git clone https://github.com/home-assistant/core
cd core/
python -m pip download -d {download_directory} --prefer-binary -r requirements_all.txt
cd {download_directory}
for file in $(ls *.tar.gz); do pip wheel --no-deps "$file" && mv "$file" "$file".built ; done
for file in $(ls *.zip); do pip wheel --no-deps "$file" && mv "$file" "$file".built ; done
cd -
time python -m pip install --dry-run --only-binary=:all: --no-index --ignore-installed --find-links file://{download_directory} -r requirements_all.txt
Output
On my machine:
Running a profile on this I see
pip._internal.utils.url.path_to_url
is called over 3,000,000 times and takes ~25% of the total run time.Code of Conduct
The text was updated successfully, but these errors were encountered: