-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Populate find_all_candidates cache from threadpool #10480
base: main
Are you sure you want to change the base?
Conversation
2d57052
to
ea48612
Compare
The general idea looks good to me, but iirc we can’t use |
@McSinyx emerges from the ground. Yup it's not portable because some exotic platform does not have proper semaphore support. There's utils.parallel wrapping |
98d4cbf
to
24f76da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this looks like it should work, the implementation has some code smells that I feel should be improved. This includes the (somewhat weird) pass
blocks, accessing private attribute on factory
, and relying on find_all_candidates
being LRU-cached. Refactoring is needed.
Put the finder into a public attribute so avoid accessing a private attribute. What do you see as the options for not relying on the caching behavior of find_all_candidates? Re: pass-es, I think we need to consume the imap iterable since it's lazily generated? but there's nothing to be done with the result of the function. |
One simple solution would be to implement a cache layer on the factory (e.g. a |
3b5ac13
to
8ee1589
Compare
I'm sorry, but I still don't think I understand what you're targeting. I think in order for this to be of use we need to parallelize when we have the list of projects available rather than at the point when find_all_candidates is called on a per project basis. This approach is only possible because we're exploiting the fact that find_all_candidates is:
I figured that since pip owns the implementation of find_all_candidates it would be ok to rely on find_all_candidates being cached? I'd appreciate it if you could give this another quick look-over and let me know in which direction you'd like to see it go. Thanks. |
Not ready for merge, but not sure if draft prs are just hidden from review queue. |
What I'm trying to say is we should implement a separate cache layer in the resolver, instead of relying on the cache layer in the finder. Like how we're doing a separate caching layer for Requirement objects instead of relying on packaging's caching (which it does not have, but that's the point—pip doesn't need to know whether packaging has a caching layer; the resolver does not need to know about the finder's cache layer either). |
I think the way in which this is different is that the resolver never calls |
Hmm, good point. Alright, let's do this then. We'll first need to resolve the conflicts, and could you investigate how difficult it would be to add a test for the caching behaviour? (e.g. mocking out some internals of |
8ee1589
to
f5a70cc
Compare
How about adding a test to the finder which demonstrates that calling |
How difficult it would be to initiate the call from the resolver? Because there's no real guarantee the resolver will always call |
You mean for the tests? or you want the prime_cache method to move? |
Sorry, I meant the test, responding to your comment before that. |
def test_resolver_cache_population(resolver: Resolver) -> None: | ||
resolver._finder.find_all_candidates.cache_clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should add this to the resolver
fixture to keep tests deterministic. Also the cleanup should probably happen after the test to not leave things behind.
366b5d1
to
b34933d
Compare
Looks like this is the failure:
|
Looks like a Sphinx bug, let's not worry about that here. sphinx-doc/sphinx#9512 |
Is there any chance of this getting merged / included in a release soon? Would be highly appreciated 😉 (while I was just waiting another 3 minutes of pip just re-collecting wheels that were all already in the cache...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to double check that our networking stack is thread safe.
@pradyunsg Going to ping @nateprewitt to confirm that the thread-safety properties of We have an in-progress PR which defers the closing of connections at the HTTPConnectionPool level until the connection pool is no longer referenced instead of during PoolManager eviction which would solve this issue. |
Amazing, thank you for the update/explanation. |
@sethmlarson yeah, the pool manager should be Requests only contention point for |
So is this good to go now? Ping @pradyunsg in case there are still concerns. |
I think the options are:
I think 2 is more invasive, less safe, and will eventually be unnecessary. So I am in favor of waiting it out (as an outsider it seems pretty close). But if you feel strongly that we should go with 2 let me know. |
From the look of things it seems a 2.0.0 release is relatively close (please correct me otherwise), so waiting that sounds like a better choice. |
urllib3 2.0.1 is available! https://github.com/urllib3/urllib3/releases/tag/2.0.1 |
10ca6a9
to
f1922cc
Compare
Requests released 2.23.0 with urllib3 2.x support last week. I’ll do the vendor update to unblock this. |
I think now that vendored urllib3 to 1.26.16 which contained a backport of a fix for a thread-safety issue this is now unblocked? |
Just to note, this will have to wait until after 23.2 is released - I don't want something this significant added to 23.2 at the last minute. I'm assuming that someone still needs to verify that the urllib3 fix does actually fix the problem that affected this PR before it can be merged, anyway? |
Absolutely, just wanted to try to push this out of limbo state.
I don't think there was ever an observed issue with this pr. We could try to produce one pre urllib3 1.26.16 and then verify that it doesn't happen with the updated version of urllib3. The difficulty would be that:
I realize that's a pretty unsatisfactory answer, so if anyone has any good ideas for other testing they'd like to see done (or even better another test that could be added) let me know. |
No worries, I just didn't want my comment on the release schedule to imply I had much of a clue about the status of this PR :-) I'm happy to leave any further review to @uranusjr and/or @pradyunsg, who have been following this more closely than I have. |
…r use in priming the resolver candidate cache
… github actions ci Run black over tests/unit/test_finder.py
4e31045
to
6ddecdf
Compare
Could this go into 23.3 ? Are there any other updates/checks needed? |
In the PR description it mentions there's a potential trade-off, albeit at the time this was still worth doing:
Is it possible that the performance win will be less now that (a) If so, would it be worth waiting until #12256 and #12257 land, and then re-benchmarking with those + |
Fetching pages from pypi to determine which versions are available is the rate limiting step of package collection. There's a bit of a tradeoff here in that by pre-populating the find_all_candidates cache in full before doing conflict resolution there's a chance that more work is done since all pages will be fetched even if there is a conflict between the first two packages. I think this still may make sense though as the wall clock time of collecting packages decreases significantly, and it's nice that the order in which packages are processed is unchanged and that part still effectively takes place in series.
Time spent on package collection decreases from ~40s to ~10 on the sample case from #10467.