-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dependency resolution: Limit unnecessary backtracking by inferring constraints from dependencies #11575
Comments
I do think it makes sense to scan the top level requirements and collect the requirements that have only one matching version. But I'm not sure how easily this would to implement? Would need to check all possible locations Pip could get packages from first (e.g. all indexes, all file locations, etc.). But otherwise seems reasonable to me, I'm not Pip maintainer but I'm sure if someone submits a PR it will be considered seriously.
Maybe I'm misunderstanding you but I don't think this logic holds beyond top level requirements as above, e.g. Let's say your top level requirement is A>1, B>1 and the latest versions are A=3 and B=3. Now let's say B has a requirement on C>1 and the latest version is C=3, and C=3 has a requirement on A<3, let's say the latest version of A<3 is A=2 so then A=2 has to be downloaded, but let's say A=2 has a requirement that C<3, and let's say C=2 is the latest version of C<3 but it turns out C=2 has no dependency on A, so A=3 can be used but only C=2 can be used even though A=3 has no requirement on C and C=2 has no requirement on A we had to perform backtracking to find this solution. This kind of situation where the requirements across versions of dependencies and transitive dependencies are not stable is what stops you from being able to make lots of simplifying assumptions. |
AFAIK pip already does this. If pip complains about a version constraint that it couldn't satisfy, it will show all versions:
So it has already collected the list of versions.
What I mean (and what you didn't use in your example if I understand it correctly) is that it can handle the dependencies from a package as top-level (or whatever it is called; requirements that pip knows for sure of they need to be installed, like user requirements) if that package is also top-level, and it has only one possible set of requirements (which it has if there's only one version available). I'm now using the term top-level for requirements that pip knows for sure are needed, but correct me if that's not what you meant with that. If not, what is the correct term for that? |
Right but that's if Pip can't find any version that satisfies requirements. But I don't think if the user specifies 10 indexes and Pip finds 1 matching version on the first index that Pip will then check the next 9 indexes. But if Pip wants to guarantee there is only 1 matching version it will need to check all 10 indexes to make sure there is exactly only 1 version available. So in some cases at least this will slow down finding a solution.
I still don't understand what benefit we can extract from this knowledge? A has it's dependencies cached and will be applied when searching through some transitive package T that depends on A. Also we can not assume that adding some version of T which depends on A will not further restrict or break any dependencies currently defined for A, e.g A's dependencies was B=3.0 but the latest version of T depends on C=1.0 which depends on B=2.0 We also can not assume that just because the latest version if T depends on A that any previous version of T will also depend on A, so T and T's dependencies may need to be carefully backtracked through even if they specify A.
Yes, I use top level requirements to mean everything that was initially provided that can not change, e.g. a requirements file the user directly provided. |
Does it? Doesn't it try to get the highest version available, even if that means searching through multiple indexes?
If I run We cannot assume anything, but if any requirement breaks any top-level constraint, it's automatically invalid, instead of possibly being backtracked. That's the benefit of having A's dependencies as top-level requirements.
That's why it should only infer dependencies as top-level if all the versions have the same dependencies. If it's a single available version it always has the same dependencies, so it can always be inferred as top-level (if it itself is top-level, at least) |
"Top level" means specified by the user on the command line or in a requirements file. Everything else is "dependency data". In a sense, "top level" requirements are ones that pip "knows for sure" are needed - but dependency data can refine or further restrict the top level requirement on a project, so that adds complexity. In particular, the requirements that pip's resolver code works with internally often aren't identifiable as "top level or not". That's not to say that we couldn't track that information, but it could be a lot harder than you think, as multiple requirements for a single project (some of which may be top level, others of which might not) get combined into a single requirement deeper in the code.
It's certainly possible that pip is doing unnecessary work. The problem is that it's far harder than you'd imagine to determine if pip knows something is unnecessary - you (as a human, with an understanding of the packages involved) might know, but pip has a much more limited view, and typically the problem is that pip simply doesn't know something that you know. It might be possible to change the code so that pip has access to that extra knowledge, but that's often not as easy as it seems at first.
Here's a good example of this. You simply can't say "A depends on B < 2.0". Dependencies are not defined per project, but per individual version of a project (per candidate to use the term pip's code uses). Sure, A 1.0 may depend on B < 2.0. But A 2.0 may depend on something completely different, not even depending on B at all. And pip won't know this without downloading A 2.0 and extracting its dependency data (a costly operation, so one we avoid doing until the last possible moment). Pip may not even know if versions of A other than 1.0 exist. The key function in the resolver that works out "what is available" returns a list of candidates (in priority order) that satisfy a requirement. That code is called in a context where it's not possible to tell if the requirement is a "top level" requirement (one that the user specified, so will always have to be satisfied) or a dependency. And indeed, as I noted above, it might even be a combination of both. So while you might be right, and pip can't usefully get any benefit from trying B 1.0, the problem lies in making pip aware of that information. And equally importantly, doing so in a way that doesn't slow down other, simpler, sets of requirements. (The problem that the resolver is solving is, in theory, exponentially hard, so there will always be cases that are unacceptably slow. We can't avoid that even in theory, so the best we can do is make a set of compromises that mean the bad cases aren't ones people care about in real life. So "does it slow down other cases" is a critical measure of whether an optimisation is acceptable.) Unfortunately, it really isn't possible to evaluate a suggestion like this without doing a pretty deep dive into how the resolver works. Most of the "low hanging fruit" has been picked off by now, I'm afraid. If you are interested in looking further into this, the places to look are the source of the
I'm going to assume you made a typo here, as A 1.0, B 1.0 is a valid solution to the requirements you gave (you need A>=1.0 and B>=1.0, and A depends on B<2.0). It doesn't affect the points I make above. |
I might be wrong but I don't think so, Pip has no guarantees on the "highest version available" just that "some" solution is found given your requirements if that solution exists.
Ah yes what you're getting at is clear to me now, this totally makes sense and I agree that on the whole this would probably be an improvement. In practice though I don't think any special step would need to be taken to make these dependencies top level requirements beyond "collect the requirements of top level packages that have only 1 match first". Because they would be the last set of requirements to remove when backtracking I think those requirements become effectively top level requirements for free. And I think it would make sense to do this step of "collecting the requirements and going through the ones that have only 1 match" at each step where a new set of requirements are added. For at least these cases where 1 matches are found it could potentially significantly speed up backtracking. @GideonBear were you looking at creating a PR or is this just a suggestion? @uranusjr Any input on if you think this is a good idea? And any opinions on how someone might go about implementing this? I guess need to see if this need to be done with |
That's not what we mean by "top level", which might explain some of the confusion. You're right, it could be possible to (carefully!) transitively expand the set of "known to be mandatory" requirements as we go. That's not how resolvelib works at the moment, as far as I am aware, but it may be possible to add something of this nature. We'd probably have to benchmark such a change to see if it was an improvement, particularly as it might trigger extra package metadata extraction or index requests (which are costly). But in principle it sounds like a possibility. It might be quite a significant change to resolvelib, though... |
I think if implemented correctly it shouldn't require any extra metadata because at some point Pip has to match against these packages anyway (because there either is only 1 match or they are the dependencies of 1 match) However the potentially extra index lookups seems like a cause for concern to me, and difficult to benchmark against because it's mostly going to affect people with private index setups. Maybe only apply this optimization when there is just one index? (that should be most users) And problematically Pip doesn't have any benchmarks for backtracking anyway so beyond collecting some real world examples and showing it improves them there isn't a good way to show improvements right now. |
I understand it might be really hard, I just opened this issue to let you know my thoughts about what might improve the dependency resolver.
Yes, sorry, with that I meant "
That wasn't a typo. Yes,
@pfmoore Do you know how pip does this?
This was just a suggestion, but I would be interested in at least looking into this, but I can't guarantee that I can actually do something with it, as I've never even looked at the resolver. BTW, thanks both for your thoughts about this! |
Ah, I see.
You're correct, if pip can do this, it will avoid checking B 2.0. To an extent, what you're proposing here is a hybrid between two different types of resolver. Pip's current resolver (resolvelib) is a backtracking resolver, which sorks its way progressively through the "tree" of requirements, backtracking if it hits a requirement it can't satisfy. Other types of resolver (SAT solvers or PubGrub, for example) work by looking "globally" at the full dependency tree. The problem for Python is that extracting dependency metadata is a costly operation, to the point where building the full dependency tree in advance is prohibitively expensive. So we have to use an algorithm that builds the tree "on demand", and backtracking algorithms are the only ones that do that. Your idea, to partially examine the parts of the tree that we know from the start (the top level requirements) and build as much of the tree from there as we can without ever making a decision that we may need to backtrack over, effectively pre-builds some of the tree for the backtracking algorithm, without doing extra work that may be unnecessary. So I agree with @notatallshaw that this shouldn't incur extra cost in the resolve phase. Looking at your suggestion this way also suggests a possible way of implementing this to me. The resolver has a mechanism to let it determine a "preference" for what candidate to consider next when faced with a choice. If we could feed the information "this is the only possibility" into that mechanism, that may prefer the "mandatory" parts of the tree without pre-building them (which would have the same effect in practice). The problem is the preference mechanism is rather a "black art" - even the code's author isn't 100% clear on how it affects the algorithm!
Without digging into the code, I don't have much more than I posted above. The place to start is in pip._internal.resolution.resolvelib, and the Having dug that far, I see that So I guess the tl; dr here is that we do fetch a list of everything, and that list might get passed up the call chain to the point where the resolver could use it. But the call chain is a mess because of the legacy resolver, and making use of that information could be more difficult than we'd like because of that.
Welcome to the fun! The resolver is one of the most tricky bits of pip, but it's also one of the most interesting, in that it's full of genuinely complex and interesting issues to work on. (It's also full of pain-in-the-a** workarounds and edge cases to handle situations that frankly I wish we could just reject, but that's life, I guess 😉)
No problem, it's nice to get back into the resolver code 🙂 |
Well I've only ever made one significant contribution to Pip's code and it was the resolver, it sure took a lot of work and understanding but it was fun and I learnt a lot, and I get to feel proud because that code speeds up a lot of people's work without them even know. So I would say worth looking at, also I'll take a look and try and come up with some ideas on how to implement this but I can't commit to anything right now. |
This is resolved with sarugaku/resolvelib#113. |
Hi @pradyunsg! Just making sure you understand my issue correctly; in When using the main branch of
I don't know if it's actually because of the PR you linked (because it seems like it isn't released yet?), but it's certainly solved! |
FYI the main branch of Pip does not yet include sarugaku/resolvelib#113 which will be a huge improvement to backtracking. It does however include sarugaku/resolvelib#111 which seems to have itself made some big improvements in certain situations. |
I thought so, yeah. But the issue was resolved anyway. |
@pradyunsg After some testing the current behavior improvement is improved by sarugaku/resolvelib#111, perhaps sarugaku/resolvelib#113 would have independently improved it but there is no improvement in the backtracking with sarugaku/resolvelib#113 applied on top of current pip main for this specific set of requirements.
Yes agreed, I tested with Pip main and with Pip main but sarugaku/resolvelib#113 applied and there is no differences. The one backtrack will always be required because Pip will try and grab the latest information version, do a BFS with the user requirements, and then backtrack on any conflicting requirements. |
What's the problem this feature will solve?
Currently, I depend on
A>=1.0
andB>=1.0
.A
depends onB<2.0
.When I install my dependencies, pip first downloads
B-2.0
andA-1.0
, as those are the latests versions that match my requirements. Then it sees thatA-1.0
wantsB<2.0
, so it downloadsB-1.0
.This is unnecessary backtracking, as A always has
B<2.0
as constraint.In my case, pip was even backtracking package C just to make it compatible with
B-2.0
, which isn't even compatible in the first place!Real-life example:
I run
pip install "pylint>=2.16.dev" "perflint>=0.7.3"
.perflint
depends onpylint<3.0
, butpylint-3.0.0a5
is still installed.After that it begins backtracking
astroid
and different versions ofpylint-3.0.0
, after it's done it notices thatpylint-3.0.0
isn't compatible withperflint
, after which it triespylint2.16.0.dev1
, which works perfectly.Complete output:
Describe the solution you'd like
Pip could always download packages which have only one matching version before packages with multiple matching versions, because it knows for sure it doesn't have to (or can't) backtrack them.
Then, for every package pip downloaded all the versions of with the same set of requirements (which includes the packages with only one matching version) that it knows of for sure that package is actually required (for example from user requirements), handle them the same way as requirements from the user (or at least, stop them from backtracking).
This should reduce backtracking, as it's essentially the same as specifying
B<2.0
in the requirement, because it can inferB<2.0
is actually needed (just like it can infer that from the user requirements) without installingB
.Alternative Solutions
You could always just let it run (it takes only a couple backtracking rounds to resolve it correctly), but an easy workaround is to specify the requirement of package B for A in the requirement, which results in no backtracking at all:
Additional context
It seems to me this is possible, but I don't know how dependency resolution works, and it's possible my logic is flawed, so sorry if this is an impossible request.
If any part of my request is unclear, please let me know.
Pip version:
pip 22.3 from ... (python 3.9)
Code of Conduct
The text was updated successfully, but these errors were encountered: