Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise logic for displaying packages at the end of pip install #12791

Merged
merged 3 commits into from
Jun 27, 2024

Conversation

morotti
Copy link
Contributor

@morotti morotti commented Jun 24, 2024

PERF: 5% faster pip install. algorithmic bug at the end of pip install in the code to display installed packages. O(n^2) to enumerate installed packages due to accidental loop in loop.

get_distribution(package_name) does a loop over all installed packages.
that is quite surprising and unexpected. it should not be used inside a loop.

I rewrote the loop to iterate with iter_all_distributions(). that's what was called under the hood.

by the way, this code to display installed packages is more complex than it ought to be. I think it was intentional to display the package name exactly as it was provided by the user.

on a pip install run that takes 11 seconds:
the loop takes 0.735960 seconds on main branch.
the loop takes 0.064672 seconds with this fix.

@morotti morotti force-pushed the perf-do-not-loop-in-loop branch from bcb47f2 to f023907 Compare June 24, 2024 17:09
@pradyunsg
Copy link
Member

pradyunsg commented Jun 24, 2024

Wheee! Nice find @morotti!

Could you add a news fragment (as described in the section linked from the details of the failure)? Specifically, it should be 12791.bugfix.rst, not bugfix.12791.rst.

(the approval above is subject to CI being happy, which it isn't yet due to the issue with the missing news fragment!)

@pradyunsg pradyunsg added the type: enhancement Improvements to functionality label Jun 24, 2024
@ichard26
Copy link
Member

I will note that most of the overhead of get_distribution on Python 3.11+ will be eliminated once #12656 is landed. This makes sense either way as an algorithmic optimization though! Also, I strongly suspect that the percentage performance increase varies based on how many packages are already installed beforehand. I'd avoid mentioning a specific figure in the NEWS entry IMO.

@morotti
Copy link
Contributor Author

morotti commented Jun 25, 2024

Hello, I think I fixed everything but there is a hook left failing "- hook id: end-of-file-fixer" rejecting the news file. How is the news file supposed to be formatted?

@morotti morotti force-pushed the perf-do-not-loop-in-loop branch from ea5a1ee to 3e5dd90 Compare June 25, 2024 10:32
@pfmoore
Copy link
Member

pfmoore commented Jun 25, 2024

Hello, I think I fixed everything but there is a hook left failing "- hook id: end-of-file-fixer" rejecting the news file.

From what I recall, that means you don't have a newline at the end of the file.

@morotti
Copy link
Contributor Author

morotti commented Jun 25, 2024

I've done a bit more testing with various types and amount of packages to install.

I am seeing more than 25% of the runtime of "pip install" is taken to do the final print statement when you reach a few hundreds packages to install. Obviously it's increasing O(n^2) with the number of packages.

It's amazing, I haven't seen a bug like this for years. That might genuinely make pip the third slowest package manager ever written, right behind easy_install that had a similar O(n^2) bug to process package with distinct versions and the windows update system in windows XP that was O(n^2) with the number of updates. :)

image

@morotti morotti changed the title PERF: 5% faster pip install. algorithmic bug at the end of pip install to display packages. PERF: 25% faster pip install. algorithmic bug at the end of pip install to display packages. Jun 25, 2024
@pradyunsg pradyunsg changed the title PERF: 25% faster pip install. algorithmic bug at the end of pip install to display packages. Optimise logic for displaying packages Jun 25, 2024
@pradyunsg
Copy link
Member

I've retitled this PR to something that more clearly reflects the change made, rather than the effect of the change.

@morotti I'd like to ask you to read https://cbea.ms/git-commit/ and rephrase the commit message to follow the guidance provided there.

(for other maintainers: when merging this, please either squash merge this with the commit message modified to follow the typical format or rewrite the commit message via a push yourself before merging this!)

@pradyunsg pradyunsg changed the title Optimise logic for displaying packages Optimise logic for displaying packages at the end of pip install Jun 25, 2024
morotti pushed a commit to man-group/pip that referenced this pull request Jun 25, 2024
get_distribution("package_name") does a loop over all packages
to find the intended package name.

the code to display installed packages was accidentally O(N^2)
for iterating over packages and calling get_distribution() on each.

Resolves: pypa#12791
@morotti morotti force-pushed the perf-do-not-loop-in-loop branch from 3e5dd90 to d77af5a Compare June 25, 2024 12:30
@morotti
Copy link
Contributor Author

morotti commented Jun 25, 2024

I've read the blog article and adjusted the commit message. I hope that suits you.

I notice the windows tests are failing, for reasons unrelated to this commit, and there is a format check still complaining about the news, I tried to adjust the formatting but I'm sorry to say I don't get it.

Can I offer to any maintainer to make the commit and news as you expect and merge? I really don't mind having or not having the commit under my account. I would just like for fixes to be merged and make the world a faster place.
It is quite challenging to onboard to a new project and learn the hundreds of CI and small rules to complete a trivial PR.

@pfmoore
Copy link
Member

pfmoore commented Jun 25, 2024

As I said previously, you simply need to add a newline at the end of the news file. I'm afraid I don't know enough git to fix this PR in place, so it'll be simpler if you do it, sorry.

get_distribution("package_name") does a loop over all packages
to find the intended package name.

the code to display installed packages was accidentally O(N^2)
for iterating over packages and calling get_distribution() on each.

Resolves: pypa#12791
@morotti morotti force-pushed the perf-do-not-loop-in-loop branch from d77af5a to 0beffc7 Compare June 25, 2024 16:24
@morotti
Copy link
Contributor Author

morotti commented Jun 26, 2024

All green now, this can be merged.

news/12791.bugfix.rst Outdated Show resolved Hide resolved
src/pip/_internal/commands/install.py Outdated Show resolved Hide resolved
src/pip/_internal/commands/install.py Show resolved Hide resolved
@ichard26
Copy link
Member

Oh, and thank you for the PR! It's nice to see the low-hanging fruit for optimising pip be addressed 🙂

@ichard26
Copy link
Member

It's amazing, I haven't seen a bug like this for years. That might genuinely make pip the third slowest package manager ever written, right behind easy_install that had a similar O(n^2) bug to process package with distinct versions and the windows update system in windows XP that was O(n^2) with the number of updates. :)

I'm sure you'll be amused to hear that the old logic was also quadratic and was similarly linearised ~6 years ago cbc21a5 :)

@morotti
Copy link
Contributor Author

morotti commented Jun 27, 2024

All adjusted as per the last comments. should be good to merge.

I'm not sure if you squash-merge or if you expect me to squash in the branch?

@pradyunsg pradyunsg merged commit 0d1cde3 into pypa:main Jun 27, 2024
28 checks passed
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 13, 2024
@ichard26 ichard26 added the type: performance Commands take too long to run label Dec 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bot:chronographer:provided type: enhancement Improvements to functionality type: performance Commands take too long to run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants