Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vendor resolvelib 1.1.0 #13001

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

notatallshaw
Copy link
Member

@notatallshaw notatallshaw commented Oct 11, 2024

Fixes #12768
Fixes #12317
Fixes #12305
Fixes #12497
Fixes #12754 (comment)

Draft PR of resolvelib beta to see if it passes tests. Will mark as ready to review and ask for review/merge when final is ready.

@notatallshaw
Copy link
Member Author

notatallshaw commented Oct 18, 2024

So I checked how resolution performed using https://github.com/notatallshaw/Pip-Resolution-Scenarios-and-Benchmarks, and I have good news and bad news.

Firstly, as expected, due to correctness fixes #12768 and #12317 resolve instead of showing ResolutionImpossible.

Some unexpected good news is this resolves #12305.

Some slightly bad news is this turns #12990 from a Build Failure to a ResolutionTooDeep, but this can be fixed if #13017 lands.

Some worse news is that the requirements in astral-sh/uv#1398 go from resolving (after some time), to ResolutionTooDeep, and I don't have an immediate optimization to fix this. It will require deeper investigation, but I don't think that should stop pip from vendoring resolvelib as it fixes correctness errors and other cases that do produce ResolutionTooDeep.

@notatallshaw
Copy link
Member Author

Ready for review / merge once 24.3 is closed.

@notatallshaw
Copy link
Member Author

It turns out I accidentally broke depth by changing some resolvelib behavior: pdm-project/pdm#3235 (comment)

I am going to investigate further on this to see if it's causing the performance regression I see in astral-sh/uv#1398, if so I'm going to make a PR on resolvelib, or if removing depth doesn't make any difference on performance then it might be worth dropping as a preference like PDM has.

@notatallshaw
Copy link
Member Author

notatallshaw commented Nov 10, 2024

Okay, I've tried fixing depths and removing depths and I ran the scenarios in https://github.com/notatallshaw/Pip-Resolution-Scenarios-and-Benchmarks to see what is the difference (the left hand side is fixing, the right hand side is removing):

uv run compare.py --github-repo-1 "notatallshaw/pip" --git-commit-1 3334e33066c30bef609d54b4d888030ad5dbef61 --github-repo-2 "notatallshaw/pip" --git-commit-2  03d9fd86ab28e9b61066d0e5b1395ddc842b29f6
Reading inline script metadata from `compare.py`
Installed 8 packages in 10ms
Difference for scenario scenarios/problematic.toml - autogluon:
    	Number of requirements processed: 197 -> 177
    	Number of packages processed: 261 -> 241

Difference for scenario scenarios/problematic.toml - TTS:
    	Success: False -> True.
    	Failure Reason: Resolution Too Deep -> None.
    	Number of requirements processed: 143 -> 156
    	Number of packages processed: 357 -> 222

Difference for scenario scenarios/problematic.toml - boto3-urllib3-transient:
    	Number of requirements processed: 429 -> 141
    	Number of packages processed: 811 -> 2675  (Both ended up with ResolutionTooDeep)

Difference for scenario scenarios/problematic.toml - apache-airflow-beam:
    	Number of packages processed: 312 -> 161

Difference for scenario scenarios/problematic.toml - kedro-test:
    	Number of requirements processed: 335 -> 337
    	Number of packages processed: 522 -> 565

Difference for scenario scenarios/problematic.toml - sphinx:
    	Number of requirements processed: 71 -> 79
    	Number of packages processed: 166 -> 199

Difference for scenario scenarios/problematic.toml - backtracks-to-old-scipy:
    	Number of requirements processed: 74 -> 86
    	Number of packages processed: 111 -> 129 (Both end in build failure)

Difference for scenario scenarios/problematic.toml - django-stubs:
    	Number of packages processed: 36 -> 25

Difference for scenario scenarios/big-packages.toml - apache-airflow-all:
    	Not the same install files.
    	Number of requirements processed: 601 -> 597
    	Number of packages processed: 733 -> 696

Difference for scenario scenarios/big-packages.toml - acryl-datahub-all:
    	Number of requirements processed: 347 -> 334
    	Number of packages processed: 711 -> 504

There are some interesting takeaways:

  1. Using depth as a preference causes the TTS Scenario to result in ResolutionTooDeep, vendoring resolvelib 1.1.0 causes this scenario to resolve because it breaks the depth preference
  2. Some scenarios resolve faster with depth removed as a preference, some are faster with depth, except for ResolutionTooDeep cases it seems removing is in general better
  3. Cases where both hit ResolutionTooDeep (i.e. boto3-urllib3-transient) the number of packages processed (and hence the wall clock time) can end up being much bigger with removing depths because it goes deep on one package (in this case botocore because it has thousands of releases)
  4. Of the resolutions which succeeded only the apache-airflow-all scenario produced a different installation result, and without going into too much details it's because apache airflow doesn't put lower bounds on their providers packages: Put reasonable lower bounds on providers in Airflow's extras apache/airflow#39100, making their resolution search space massive

My conclusions from this are:

  1. On the whole, removing depth seems to be a net positive, probably because resolvelib now backjumps and skips of unnecessary backtracking state, and it’s worth removing like PDM already has
  2. It may be worth reducing the maximum number of backtracking steps before hitting ResolutionTooDeep

@notatallshaw notatallshaw force-pushed the vendor-resolvelib-1.1.0 branch from d309e2c to 3a157bb Compare November 10, 2024 20:14
@notatallshaw notatallshaw marked this pull request as ready for review November 10, 2024 20:35
@notatallshaw
Copy link
Member Author

notatallshaw commented Nov 10, 2024

This is ready for review now.

I have removed depth as a preference, which unfortunately means removing the only unit tests that were unit testing the PipProvider (though it is tested greatly as part of functional tests), but I will add additional unit tests to get_preference, and friends, in a follow up PR (that will replace #13017) once this lands.

@ichard26 ichard26 requested a review from uranusjr December 9, 2024 04:02
@notatallshaw
Copy link
Member Author

notatallshaw commented Dec 14, 2024

I'm marking this back to draft while we discuss performance issues with resolvelib 1.1, see discussion here: sarugaku/resolvelib#180

At some point we might have to decide whether to vendor this version of resolvelib, which includes an important correctness fix, and some performance improvements, but some big performance regressions, or decide to wait for a resolvelib 1.2 which might be able to regain some of the performance back in very deep backtracking situations (e.g. boto3 / urllib3).

I would very much like input from other maintainers on that, both here and over on the resolvelib performance issue.

@ichard26 ichard26 mentioned this pull request Dec 16, 2024
@ichard26
Copy link
Member

@notatallshaw I'm generally in favour of correctness improvements as AFAIU the user usually has workarounds to deal with excessive backtracking / ResolutionImpossible, i.e. restrict the version ranges pip considers. In contrast, issues like #12317 cause pip to have straight-up broken behaviour. A valid solution exists (and within the backtrack limit), but pip skips over it. Last time I had to debug this with a colleague, it was quite annoying and frustrating.

Also, there are additional optimizations in the works, right? I see that you have a draft optimistic backjumping PR on resolvelib's side, and then there's #12317 (prefer direct conflicts on backtrack) on pip's end.

How common are the known scenarios where this PR is a net negative? As long as they're not common, I'd say the small amount of fallout doesn't outweigh the correctness improvements.

cc @pfmoore @sbidoul

@sbidoul sbidoul added the C: dependency resolution About choosing which dependencies to install label Jan 12, 2025
@pfmoore
Copy link
Member

pfmoore commented Jan 12, 2025

I agree with @ichard26 - we should prioritise correctness over performance. Claiming there is no solution when there is one, is a bug, and we should prioritise fixing that even if it requires a slower approach.

Pip's focus is on standards compliance and correctness, with performance being secondary (although clearly important as long as the first two criteria don't suffer). If users need performance and are willing to compromise on the other factors, then they are likely to be choosing uv anyway (and even if they prefer to use pip in general, "use uv" is a reasonable workaround for cases where pip's performance is an issue and strict correctness doesn't matter).

@notatallshaw
Copy link
Member Author

notatallshaw commented Jan 12, 2025

Thanks both your feedback, the problem here is performance means the user will see a "ResolutionTooDeep", there are definitely real cases involving boto3 and urllib3 as dependencies where this will happen with resolvelib 1.1.0 where it did not happen in the previous version. So if this is merged we are likely to see additional users complain.

If other maintainers are comfortable with that, knowing we are achieving correctness as best we can, at the cost of some users hitting resolution problems for now, then I am going to mark this as ready for review, if we can merge soon, I can create another PR to add a few simple performance improvements with the new API before pip 25.0.

I took a look at implementing a fallback inside resolvelib but also found it would cause at least one known "ResolutionTooDeep" that resolvelib 1.1.0 doesn't 🙁. So I'm still on the fence about that.

I'm going to take a deeper look in a few weeks if we can have our cake and eat it, i.e. introduce this optimization while remaining formally correct, but my conclusion may end up being that the resolvelib API needs to change in non-trivial ways, or we need to consider writing a new resolver, one that learns the lessons from resolvelib, uv, and Poetry.

@notatallshaw notatallshaw marked this pull request as ready for review January 12, 2025 14:43
Copy link
Member

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's to another day where I am annoyed that botocore has thousands of releases... even if it keeps us honest.

@@ -0,0 +1 @@
Upgrade resolvelib to 1.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably call out that this improves correctness, but may introduce worse performance in certain situations. I'm being hand-wavy with my wording as I assume you can come up with something better.

Copy link
Member Author

@notatallshaw notatallshaw Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the news item, and seperately I think I will make ResolutionTooDeep a diagnostic exception and point users to a specific pip issue where I can collect scenarios where this is happens.

@pfmoore
Copy link
Member

pfmoore commented Jan 12, 2025

Here's to another day where I am annoyed that botocore has thousands of releases... even if it keeps us honest.

I'll be blunt, I think that "thousands of releases" should stress a backtracking resolver (or for that matter any form of resolver). If users are writing requirements that state that any of many thousands of botocore releases are equally valid, then I'd question those requirements. Obviously, no-one is actually going to bother trying to add lower bounds to every botocore requirement out there - and it only takes one unbounded requirement, selected at the wrong point in the process, to blow up. But if we're optimising for the use case of "please pick any one of these 3,000 candidate versions" then something is seriously skewed1.

Maybe what we should have is some sort of limit in the finder - by default, only consider the latest 1002 versions of a project. Then, people using a pathological project like botocore would have to explicitly opt-in with --max-candidates=2000. This would cause some users pain, I'm sure, but it would put the focus on the actual cause of the problem, which is projects that publish unreasonable numbers of releases.

Footnotes

  1. And IMO, what's skewed is botocore's release policy, not our resolver 🙁

  2. I picked that number out of the air - if I were doing this for real, I'd do some analysis before choosing the default.

@notatallshaw notatallshaw force-pushed the vendor-resolvelib-1.1.0 branch from bd61563 to 96ef7e2 Compare January 12, 2025 18:11
@notatallshaw notatallshaw force-pushed the vendor-resolvelib-1.1.0 branch from 96ef7e2 to aaa3ce7 Compare January 12, 2025 18:11
@notatallshaw
Copy link
Member Author

notatallshaw commented Jan 12, 2025

I'll be blunt, I think that "thousands of releases" should stress a backtracking resolver (or for that matter any form of resolver). If users are writing requirements that state that any of many thousands of botocore releases are equally valid, then I'd question those requirements. Obviously, no-one is actually going to bother trying to add lower bounds to every botocore requirement out there - and it only takes one unbounded requirement, selected at the wrong point in the process, to blow up. But if we're optimising for the use case of "please pick any one of these 3,000 candidate versions" then something is seriously skewed

Setting aside the problem of thousands of releases, these complex dependency graphs naturally arise from how Python packaging handles dependencies. For example, Python package dependencies are an NP hard problem, other languages (e.g. rust and NodeJS) sidestep this by allowing multiple versions of the same library as a solution. Even without boto3 there are many other examples where resolution is very hard, just boto3 sticks out more obviously because it can cause a huge number of downloads.

Maybe what we should have is some sort of limit in the finder - by default, only consider the latest 1002 versions of a project. Then, people using a pathological project like botocore would have to explicitly opt-in with --max-candidates=2000. This would cause some users pain, I'm sure, but it would put the focus on the actual cause of the problem, which is projects that publish unreasonable numbers of releases.

I'm dubious but would be happy to discuss elsewhere, I think this would break resolutions in a non-obvious way. Let's say you have some package foo that has versions 1 to 10'000, and out limit is 1000. If the user depends on A and B, and the latest version of A depends on foo > 5000 and the latest version of B depends on foo < 8000, the user would get a ResolutionImpossible but only after backtracking through every version of B, unless an old version of B didn't put an upper bound on foo but was functionally incompatible the user would find a solution but one in which B was functionally broken.

And the problem is the number of possible solutions can grow exponentially along dimensions other than number of candidates, e.g. number of requirements. So this doesn’t work when the number of requirements are causing the issue, in which case you may need a very small number of max candidates (e.g. 10) to avoid a ResolutionTooDeep, and you may just get a ResolutionImpossible instead and you would have to do significant analysis to understand why.

@pfmoore
Copy link
Member

pfmoore commented Jan 12, 2025

Setting aside the problem of thousands of releases, these complex dependency graphs naturally arise from how Python packaging handles dependencies.

Agreed 100%. I'm not trying to minimise the complexity of the problem here, but I think that packages like botocore with thousands of releases are (and should be treated as) very much outliers. I don't think we should compromise correctness, or the performance of simpler cases, just to accommodate such projects.

I'm dubious but would be happy to discuss elsewhere, I think this would break resolutions in a non-obvious way.

It's quite possible it would. I haven't done any sort of analysis at this point. I'm happy to drop the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment