Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer closest conflicting causes when backtracking #12459

Closed

Conversation

notatallshaw
Copy link
Member

@notatallshaw notatallshaw commented Jan 2, 2024

This PR makes Pip far more efficent at resolving when there are many possible causes of as part of a backtracking conflict. E.g. sphinx sphinx-rtd-theme sphinx-toolbox myst-parser sphinxcontrib-bibtex nbsphinx currently produces "ResolutionTooDeep" but solves quickly under this PR.

This PR comes in three commits:

  1. A pseudo vendoring of resolvelib, this commit can be deleted for when/if sarugaku/resolvelib#145 is accepted and resolvelib is next released and vendored
  2. Moves the existing logic of prefering backtracking from get_preference to a new method filter_unsatisfied_names, which solves this users issue: #10621 (comment)
  3. Implements causes_with_conflicting_parent and causes_with_no_candidates functions and uses them as part of the new filter_unsatisfied_names to prefer conflicting causes

Still to do:

  1. Get the resolvelib PR agreed and accepted, I am making this PR now because I think the approach is ready for review and it will also confirm the need for the resolvelib PR
  2. Confirm correctly testing emptyness of specifier set: packaging#760
  3. Write some tests, particularly all known scenarios for causes_with_conflicting_parent and causes_with_no_candidates

@notatallshaw
Copy link
Member Author

Seems new version of setuptools is causing a lot of test failures at the moment, I'll rebase once #12458 hopefully lands.

@notatallshaw notatallshaw force-pushed the prefer-conflicting-causes branch from 3ed5bb2 to 0d9f8eb Compare January 3, 2024 23:31
@notatallshaw
Copy link
Member Author

Another thing to mention, with this optimization it is possible to remove backjumping and still stay performant on complex requirements, see sarugaku/resolvelib#142.

@notatallshaw
Copy link
Member Author

notatallshaw commented Jan 5, 2024

I've broken up the previous function conflicting_causes into two functions causes_with_conflicting_parent and causes_with_conflicting_specifiers.

I beleive individually they are easier to read, their intent more obvious, will be easier to add tests, and actually improves the performance because of the way filter_unsatisfied_names can now check one, then the other.

@notatallshaw notatallshaw changed the title Prefer conflicting causes when backtracking Prefer closest conflicting causes when backtracking Jan 5, 2024
@notatallshaw notatallshaw force-pushed the prefer-conflicting-causes branch 2 times, most recently from cfb8d33 to 97d7a36 Compare January 5, 2024 05:46
@notatallshaw
Copy link
Member Author

notatallshaw commented Jan 5, 2024

It seems I got confused about how packaging handles the intersection of SpecifierSets, I thought it in some sense it "resolved" them and I could use that to work out if two SpecifierSets were mutually exclusive. This was incorrect.

I've created a fairly simple _is_mutually_exclusive_specifier_sets, which works in simple cases, but I have an open question about boundry conditions: pypa/packaging#762 (comment)

I also need to retest all my performance results.

@notatallshaw notatallshaw force-pushed the prefer-conflicting-causes branch from 97d7a36 to 9b19347 Compare January 5, 2024 08:46
@notatallshaw
Copy link
Member Author

I've been working on, and exhaustively testing, a function that can determine if two specifiers are mutually exclusive. I think this would depend on upgrading the packaging library, because I don't want to think about legacy versions, also I think I may have found a bug in the packaging library (pypa/packaging#766).

It occurs to me, that if possible, a much better approach to this would be to check if all the specifiers produce 0 possible packages from all possible packages, but I don't know if this is possible at provider resolver time, I will investigate. If this is obvious to anyone, feedback would be welcome.

This PR is becoming a bit messy from me repeatedly updating it as I test and self-review. I may delete it and recreate it so it's easier to review when it's at that stage.

@notatallshaw notatallshaw force-pushed the prefer-conflicting-causes branch from 9b19347 to 0bbfd8c Compare January 6, 2024 19:13
@notatallshaw notatallshaw force-pushed the prefer-conflicting-causes branch from 0bbfd8c to ec890bf Compare January 6, 2024 19:16
@notatallshaw
Copy link
Member Author

notatallshaw commented Jan 6, 2024

It occurs to me, that if possible, a much better approach to this would be to check if all the specifiers produce 0 possible packages from all possible packages, but I don't know if this is possible at provider resolver time, I will investigate. If this is obvious to anyone, feedback would be welcome.

I was able to implement this, I replaced the function causes_with_conflicting_specifiers with causes_with_no_candidates. This works well in the sense that it doesn't require Pip to understand the intricacies of specifiers, all the specifier logic remains outsourced to packaging.

However this has the downside that it may cause more downloads to get candidate metadata and getting that metadata might throw an exception if an sdist fails to build. I have therefore made causes_with_no_candidates return the first cause pair it finds rather than searching for all possible causes, and ordered causes_with_conflicting_parent infront of it. This seems to work well in the test scenarios I have run so far, I will spend some time rerunning a wide range of scenarios and building performance numbers.

It may be worth adding back causes_with_conflicting_specifiers, it could run before causes_with_no_candidates, to have the best of both worlds. But the more I work on a function which can identify mutually exclusive package specifiers the more it feels like it's complex and novel enough to be academic paper worthy. I would be inclined to add this as a seperate PR at a later date.

@notatallshaw notatallshaw force-pushed the prefer-conflicting-causes branch from 5c73c71 to d5a569b Compare January 7, 2024 19:21
@notatallshaw
Copy link
Member Author

notatallshaw commented Jan 7, 2024

Interestingly the PR, in it's current state, seems to fix all known cases of #12317, this was not intentional but I do have some intuitive understanding of why it might have happened.

I have written about it in a little more details here sarugaku/resolvelib#142 (comment)

@pfmoore
Copy link
Member

pfmoore commented Jan 27, 2024

I think it's important that we try to make progress on some of the resolver improvements going on here, so I took a look at this PR. On a high level, though, I'm not 100% sure how the various changes that you have discussed hang together (the ones to do with backjumping, this one, and possibly others). So ideally, it would be helpful if you could put together a unified summary of what you are actually proposing, and what are superseded approaches. Otherwise, I'm concerned that by looking at PRs individually, I'll miss the "big picture". It's quite possible it's actually very simple - my view of the work is based on a flow of github notifications from the pip and resolvelib projects, which is not the most organised way of getting information!!! If so, that's great, and I apologise for my confusion.

Having said that, my biggest concern with this PR is not actually with the PR itself, but with the associated resolvelib change. You're introducing a new filter_unsatisfied_names method, which somehow helps with the resolution process. But as a client of the resolvelib API, I'm not at all sure I understand what that method does, or how it interacts with the existing get_preference method. To be fair, I never understood get_preference - it seemed like something you sort of mostly ignored, or guessed at something that might help, but it wasn't at all clear what it did. From what I recall, @uranusjr felt the same way, and he's the library author!

So what I'd like here from pip's side of things is much clearer documentation in resolvelib on what the get_preference and filter_unsatisfied_names methods are actually expected to do, and how clients are expected to implement them - in the abstract sense, not tied to details like "this fixes the following pathological case in pip". To put it another way, I'd like to see the methods documented in a way that meant that I could look at the resolvelib docs and see what I needed to implement in pip, and why - and then reproduce this PR based on my understanding.

The reason I'm asking for this is that my biggest concern here is that we could end up with some sort of heuristic that no-one really understands, and which mostly "just works" as long as no-one touches it. That would be a nightmare for maintenance. The idea of the new resolver being based on resolvelib was to make sure that the pip maintainers didn't have to manage the details of the resolution process, and conversely to ensure that if someone wanted to write a standalone package resolver, they would not need to copy chunks of pip's logic to do so.

@notatallshaw
Copy link
Member Author

notatallshaw commented Jan 27, 2024

So ideally, it would be helpful if you could put together a unified summary of what you are actually proposing, and what are superseded approaches ...
... my view of the work is based on a flow of github notifications from the pip and resolvelib projects, which is not the most organised way of getting information

You haven't overlooked anything; the situation is a bit tangled right now. Real-world examples revealing the limitations of the current approach have come in slowly, giving me different ideas on how to speed up the resolution over time, and when making this PR I found some of my assumptions faulty and had to pivot parts of the approach accordingly.

I'm thinking of replacing the current issue and PR and create a new issue outlining the overall approach more coherently and following up with a clean PR showcasing the implementation attempt. Would you find this approach more helpful or more frustrating?

my biggest concern with this PR is not actually with the PR itself, but with the associated resolvelib change. You're introducing a new filter_unsatisfied_names method ...
... I'm not at all sure I understand what that method does, or how it interacts with the existing get_preference method ...
... it seemed like something you sort of mostly ignored, or guessed at something that might help

Introducing the filter_unsatisfied_names method wasn't a shot in the dark. It aligns with the primary goal of get_preference, guiding resolvelib through the dependency graph. However, they differ in their approach. While get_preference is called once for each unresolved requirement, filter_unsatisfied_names is called once for all requirements.

This is crucial when it would make sense to short-circuit checking remaining requirements, or when the checks are not independent of checking other requirements, which even in Pip currently, results in duplicated work (and O(n^2) situations).

The approach in this PR doesn't strictly need filter_unsatisfied_names to address the issue of complex backtracking causing ResolutionTooDeep. However, if implemented using get_preference, it would significantly slow down simple backtracking scenarios. If necessary, I can back this up with empirical data using the Pip resolver benchmark tool.

So what I'd like here from pip's side of things is much clearer documentation in resolvelib on what the get_preference and filter_unsatisfied_names methods are actually expected to do

I agree that clearer guidance on get_preference and filter_unsatisfied_names in resolvelib would be helpful. I can open this as an issue on the resolvelib side and work towards implementing it for filter_unsatisfied_names in my PR there.

The idea of the new resolver being based on resolvelib was to make sure that the pip maintainers didn't have to manage the details of the resolution process, and conversely to ensure that if someone wanted to write a standalone package resolver, they would not need to copy chunks of pip's logic to do so.

Currently, resolvelib and pip segregate the generic resolution aspects to resolvelib and the Python packaging nuances to pip. However, addressing the intricacies of Python packaging requires accounting for many of those nuances, and solving complex backtracking efficiently, using a DFS-style algorithm, demands additional insights.

I don't see how these nuances could be moved to resolvelib, to move them out of Pip I think you would need to make another abstracting library, e.g. a Python packaging specific resolvelib.

@pfmoore
Copy link
Member

pfmoore commented Jan 27, 2024

Introducing the filter_unsatisfied_names method wasn't a shot in the dark.

Sorry if it sounded like I was suggesting it was. What I was trying to say was that it's important that using that method (whether in pip, or in some other client of resolvelib) shouldn't be a matter of guesswork. That was the problem with get_preference - everyone basically started by just having it return 1, with a comment along the lines of "no idea what to do with this", and then over time added more logic based on what always felt like either guesswork, or a far too intimate knowledge of the implementation of resolvelib. (And when I refer to "guesswork", I'm specifically meaning that's how I made changes to that function!)

Currently, resolvelib and pip segregate the generic resolution aspects to resolvelib and the Python packaging nuances to pip. However, addressing the intricacies of Python packaging requires accounting for many of those nuances, and solving complex backtracking efficiently, using a DFS-style algorithm, demands additional insights.

Sorry to be blunt here, but that sounds awfully hand-wavy. I'm not suggesting that resolvelib knows about Python packaging - definitely not, having it be a generic resolver library is important IMO. But conversely, resolvelib should be able to describe its API in its own terms, without referring to Python packaging. And that means basically describing "what does this knob do" so that when clients want to tweak it, they can do so in an informed way, rather than just (for example) "let's prefer packages with lots of dependencies over packages with only a few and see what that does" (which is how I always felt I was reasoning about get_preference).

Maybe it'd be better to wait and see the resolvelib PR - debating principles is probably not very productive at this point.

@notatallshaw
Copy link
Member Author

notatallshaw commented Jan 27, 2024

Maybe it'd be better to wait and see the resolvelib PR - debating principles is probably not very productive at this point.

I do not see resolvelib adding a new API unless there is agreement from Pip maintainers it would be a good idea.

Some changes on the resolvelib side are currently stuck, awaiting input from Pip maintainers: sarugaku/resolvelib#134 (comment).

I would really prefer that the outcome of these discussions not lead to a circular lock.

Sorry to be blunt here, but that sounds awfully hand-wavy. I'm not suggesting that resolvelib knows about Python packaging - definitely not, having it be a generic resolver library is important IMO. But conversely, resolvelib should be able to describe its API in its own terms, without referring to Python packaging. And that means basically describing "what does this knob do" so that when clients want to tweak it, they can do so in an informed way, rather than just (for example) "let's prefer packages with lots of dependencies over packages with only a few and see what that does" (which is how I always felt I was reasoning about get_preference).

Well, as I didn't design or implement resolvelib or the new resolver, I can't really take offense. I'm just describing to you the current situation.

The simplest way I can think to answer the question of "what does get_preference do?" is as follows: Imagine implementing a DFS algorithm on a data structure of nodes and child nodes, where each node has an unordered set of its child nodes.

In the DFS, how do you pick which child node should be chosen? You implement a method for choosing them, and in resolvelib, this is done with "get_preference".

Now, each node has conditions on what the final global state should be, and you're trying to select a collection of nodes so none of them disagree about the final global state. To do that performantly, you should understand the conditions different nodes have and the relationships between different conditions, and select child nodes that you think would find this selection of satisfying conditions quicker.

resolvelib, at least currently, does not assume almost anything about those conditions and how they may or may not contradict. It is, therefore, a poor algorithm to solve for complex dependency graphs in a performant manner, e.g. taking less time than the heat death of the universe. The provider must, therefore, encode what it knows about these conditions and guide resolvelib, and this currently happens entirely via get_preference.

@notatallshaw
Copy link
Member Author

As requested, made a cleaner PR #12499 and high level "plan": #12498

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants