Populate find_all_candidates cache from threadpool #10480

jbylund · 2021-09-16T16:24:27Z

Fetching pages from pypi to determine which versions are available is the rate limiting step of package collection. There's a bit of a tradeoff here in that by pre-populating the find_all_candidates cache in full before doing conflict resolution there's a chance that more work is done since all pages will be fetched even if there is a conflict between the first two packages. I think this still may make sense though as the wall clock time of collecting packages decreases significantly, and it's nice that the order in which packages are processed is unchanged and that part still effectively takes place in series.

Time spent on package collection decreases from ~40s to ~10 on the sample case from #10467.

uranusjr · 2021-09-20T21:40:44Z

The general idea looks good to me, but iirc we can’t use multiprocessing for some reason (some platforms don’t support threads? I don’t remember). Summoning @McSinyx who recently dealt with parallelisation bug reports.

McSinyx · 2021-09-21T05:06:14Z

@McSinyx emerges from the ground.

Yup it's not portable because some exotic platform does not have proper semaphore support. There's utils.parallel wrapping imap_unordered though and I think it should be safe to use that.

uranusjr

While this looks like it should work, the implementation has some code smells that I feel should be improved. This includes the (somewhat weird) pass blocks, accessing private attribute on factory, and relying on find_all_candidates being LRU-cached. Refactoring is needed.

news/10480.feature.rst

jbylund · 2021-09-21T13:46:24Z

Put the finder into a public attribute so avoid accessing a private attribute.

What do you see as the options for not relying on the caching behavior of find_all_candidates?

Re: pass-es, I think we need to consume the imap iterable since it's lazily generated? but there's nothing to be done with the result of the function.

uranusjr · 2021-09-22T03:32:40Z

One simple solution would be to implement a cache layer on the factory (e.g. a Factory.find_all_candidates() wrapper). If parallisation is available, the wrapper would populate the cache with imap on the first invocation; if not, it’d simply pass on the call the the finder. This would also resolve the private attribute issue.

jbylund · 2021-09-29T14:10:39Z

One simple solution would be to implement a cache layer on the factory (e.g. a Factory.find_all_candidates() wrapper). If parallisation is available, the wrapper would populate the cache with imap on the first invocation; if not, it’d simply pass on the call the the finder. This would also resolve the private attribute issue.

I'm sorry, but I still don't think I understand what you're targeting. I think in order for this to be of use we need to parallelize when we have the list of projects available rather than at the point when find_all_candidates is called on a per project basis.

This approach is only possible because we're exploiting the fact that find_all_candidates is:

the bottleneck
cached
embarrassingly parallelizable (well close enough)

I figured that since pip owns the implementation of find_all_candidates it would be ok to rely on find_all_candidates being cached?

I'd appreciate it if you could give this another quick look-over and let me know in which direction you'd like to see it go. Thanks.

jbylund · 2021-10-04T15:37:01Z

Not ready for merge, but not sure if draft prs are just hidden from review queue.

uranusjr · 2021-10-08T04:13:25Z

What I'm trying to say is we should implement a separate cache layer in the resolver, instead of relying on the cache layer in the finder. Like how we're doing a separate caching layer for Requirement objects instead of relying on packaging's caching (which it does not have, but that's the point—pip doesn't need to know whether packaging has a caching layer; the resolver does not need to know about the finder's cache layer either).

jbylund · 2021-10-11T23:00:15Z

I think the way in which this is different is that the resolver never calls find_all_candidates except via find_best_candidates? so unless the package finder's find_best_candidate is implemented by using find_all_candidates there's no benefit to be had here?

uranusjr · 2021-10-11T23:14:53Z

Hmm, good point. Alright, let's do this then. We'll first need to resolve the conflicts, and could you investigate how difficult it would be to add a test for the caching behaviour? (e.g. mocking out some internals of find_all_candidates and ensure they are called only once)

jbylund · 2021-10-12T20:59:38Z

How about adding a test to the finder which demonstrates that calling find_best_candidate after calling find_all_candidates results in a cache hit?

uranusjr · 2021-10-12T21:11:47Z

How about adding a test to the finder which demonstrates that calling find_best_candidate after calling find_all_candidates results in a cache hit?

How difficult it would be to initiate the call from the resolver? Because there's no real guarantee the resolver will always call find_best_candidate (we might miss it in a refactoring or something), and what we really want is for the resolver to fetch each package list exactly once in its lifetime, not the finder.

src/pip/_internal/resolution/resolvelib/resolver.py

jbylund · 2021-10-12T21:14:13Z

How about adding a test to the finder which demonstrates that calling find_best_candidate after calling find_all_candidates results in a cache hit?

How difficult it would be to initiate the call from the resolver? Because there's no real guarantee the resolver will always call find_best_candidate (we might miss it in a refactoring or something), and what we really want is for the resolver to fetch each package list exactly once in its lifetime, not the finder.

You mean for the tests? or you want the prime_cache method to move?

uranusjr · 2021-10-12T21:22:46Z

Sorry, I meant the test, responding to your comment before that.

uranusjr · 2021-10-12T22:58:48Z

tests/unit/resolution_resolvelib/test_resolver.py

+def test_resolver_cache_population(resolver: Resolver) -> None:
+    resolver._finder.find_all_candidates.cache_clear()


I feel we should add this to the resolver fixture to keep tests deterministic. Also the cleanup should probably happen after the test to not leave things behind.

jbylund · 2021-10-21T19:16:11Z

Looks like this is the failure:


2021-10-21T19:03:24.7123305Z nox > Running session docs
2021-10-21T19:03:24.7469283Z nox > Creating virtual environment (virtualenv) using python in .nox/docs
2021-10-21T19:03:25.8502672Z nox > python -m pip install -e .
2021-10-21T19:03:36.2667130Z nox > python -m pip install -r docs/requirements.txt
2021-10-21T19:03:55.7121876Z nox > sphinx-build -W -c docs/html -d docs/build/doctrees/html -b html docs/html docs/build/html
2021-10-21T19:03:56.1883325Z Traceback (most recent call last):
2021-10-21T19:03:56.1886244Z   File "/home/runner/work/pip/pip/.nox/docs/bin/sphinx-build", line 5, in <module>
2021-10-21T19:03:56.1887811Z     from sphinx.cmd.build import main
2021-10-21T19:03:56.1889421Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/cmd/build.py", line 25, in <module>
2021-10-21T19:03:56.1890572Z     from sphinx.application import Sphinx
2021-10-21T19:03:56.1892042Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/application.py", line 32, in <module>
2021-10-21T19:03:56.1893419Z     from sphinx.config import Config
2021-10-21T19:03:56.1896294Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/config.py", line 21, in <module>
2021-10-21T19:03:56.1897565Z     from sphinx.util import logging
2021-10-21T19:03:56.1898933Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/util/__init__.py", line 41, in <module>
2021-10-21T19:03:56.1900080Z     from sphinx.util.typing import PathMatcher
2021-10-21T19:03:56.1901526Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/util/typing.py", line 37, in <module>
2021-10-21T19:03:56.1902575Z     from types import Union as types_Union
2021-10-21T19:03:56.1903878Z ImportError: cannot import name 'Union' from 'types' (/opt/hostedtoolcache/Python/3.10.0/x64/lib/python3.10/types.py)
2021-10-21T19:03:56.2096739Z nox > Command sphinx-build -W -c docs/html -d docs/build/doctrees/html -b html docs/html docs/build/html failed with exit code 1
2021-10-21T19:03:56.2098058Z nox > Session docs failed.

uranusjr · 2021-10-22T07:01:07Z

Looks like a Sphinx bug, let's not worry about that here. sphinx-doc/sphinx#9512

bluenote10 · 2022-01-20T09:17:02Z

Is there any chance of this getting merged / included in a release soon? Would be highly appreciated 😉 (while I was just waiting another 3 minutes of pip just re-collecting wheels that were all already in the cache...)

pradyunsg

We probably want to double check that our networking stack is thread safe.

sethmlarson · 2022-07-15T13:36:50Z

@pradyunsg urllib3.PoolManager is only unsafe when the number of distinct origins is more than the number of urllib3.ConnectionPools allowed within the PoolManager. Setting num_pools (or pool_connections on requests.Session) to be greater than the number of origins would avoid the problems being referenced.

Going to ping @nateprewitt to confirm that the thread-safety properties of requests.Session are the same as urllib3.PoolManager.

We have an in-progress PR which defers the closing of connections at the HTTPConnectionPool level until the connection pool is no longer referenced instead of during PoolManager eviction which would solve this issue.

jbylund · 2022-07-15T13:50:03Z

We have an in-progress PR which defers the closing of connections at the HTTPConnectionPool level until the connection pool is no longer referenced instead of during PoolManager eviction which would solve this issue.

Amazing, thank you for the update/explanation.

nateprewitt · 2022-07-15T18:16:17Z

Going to ping @nateprewitt to confirm that the thread-safety properties of requests.Session are the same as urllib3.PoolManager.

@sethmlarson yeah, the pool manager should be Requests only contention point for send in a threaded context. Obviously mutating Session-level settings like adapters/proxies etc, won't be thread safe but I don't believe pip is doing any of that.

uranusjr · 2022-11-10T23:08:57Z

So is this good to go now? Ping @pradyunsg in case there are still concerns.

jbylund · 2022-11-10T23:21:54Z

I think the options are:

wait until urllib3 releases a 2.0.0 (https://github.com/urllib3/urllib3/milestone/6) update the vendor-ed urllib3 then this should be good to go.
increase the connection pool size so that we feel the race condition is unlikely to be hit

I think 2 is more invasive, less safe, and will eventually be unnecessary. So I am in favor of waiting it out (as an outsider it seems pretty close). But if you feel strongly that we should go with 2 let me know.

uranusjr · 2022-11-10T23:27:46Z

From the look of things it seems a 2.0.0 release is relatively close (please correct me otherwise), so waiting that sounds like a better choice.

sethmlarson · 2023-05-01T14:28:47Z

urllib3 2.0.1 is available! https://github.com/urllib3/urllib3/releases/tag/2.0.1

uranusjr · 2023-05-10T07:55:28Z

Requests released 2.23.0 with urllib3 2.x support last week. I’ll do the vendor update to unblock this.

jbylund · 2023-07-10T13:20:35Z

I think now that vendored urllib3 to 1.26.16 which contained a backport of a fix for a thread-safety issue this is now unblocked?

pfmoore · 2023-07-10T14:16:33Z

Just to note, this will have to wait until after 23.2 is released - I don't want something this significant added to 23.2 at the last minute.

I'm assuming that someone still needs to verify that the urllib3 fix does actually fix the problem that affected this PR before it can be merged, anyway?

jbylund · 2023-07-10T14:26:55Z

Just to note, this will have to wait until after 23.2 is released - I don't want something this significant added to 23.2 at the last minute.

Absolutely, just wanted to try to push this out of limbo state.

I'm assuming that someone still needs to verify that the urllib3 fix does actually fix the problem that affected this PR before it can be merged, anyway?

I don't think there was ever an observed issue with this pr. We could try to produce one pre urllib3 1.26.16 and then verify that it doesn't happen with the updated version of urllib3. The difficulty would be that:

we only expect the possibility of some sort of thread unsafety if there are many different origins - which I don't think comes up in the use of pip very frequently
it could still be very difficult to reproduce since it would depend on the network behavior

I realize that's a pretty unsatisfactory answer, so if anyone has any good ideas for other testing they'd like to see done (or even better another test that could be added) let me know.

pfmoore · 2023-07-10T14:34:51Z

No worries, I just didn't want my comment on the release schedule to imply I had much of a clue about the status of this PR :-) I'm happy to leave any further review to @uranusjr and/or @pradyunsg, who have been following this more closely than I have.

…r use in priming the resolver candidate cache

… github actions ci Run black over tests/unit/test_finder.py

jbylund · 2023-09-06T12:12:01Z

Could this go into 23.3 ? Are there any other updates/checks needed?

edmorley · 2023-09-06T13:21:05Z

In the PR description it mentions there's a potential trade-off, albeit at the time this was still worth doing:

There's a bit of a tradeoff here in that by pre-populating the find_all_candidates cache in full before doing conflict resolution there's a chance that more work is done since all pages will be fetched even if there is a conflict between the first two packages.

Is it possible that the performance win will be less now that (a) .metadata files are now supported in pip 23.2, (b) there are about to be further optimisations to the index page/metadata handling (eg #12256 and #12257)?

If so, would it be worth waiting until #12256 and #12257 land, and then re-benchmarking with those + --use-feature=metadata-cache enabled?

cosmicexplorer · 2024-01-11T06:27:35Z

I am picking back up #12256 and #12257 (don't forget #12258, we can go deeper), so please feel free to review any of those if you're interested in how they affect pip's performance characteristics. Thanks!

jbylund force-pushed the joe/warm_cache_in_threadpool branch from 2d57052 to ea48612 Compare September 16, 2021 17:39

jbylund force-pushed the joe/warm_cache_in_threadpool branch from 98d4cbf to 24f76da Compare September 21, 2021 11:32

uranusjr requested changes Sep 21, 2021

View reviewed changes

news/10480.feature.rst Outdated Show resolved Hide resolved

jbylund marked this pull request as draft September 22, 2021 13:36

jbylund force-pushed the joe/warm_cache_in_threadpool branch from 3b5ac13 to 8ee1589 Compare September 22, 2021 18:28

jbylund marked this pull request as ready for review October 4, 2021 15:36

jbylund requested a review from uranusjr October 7, 2021 11:20

github-actions bot added the needs rebase or merge PR has conflicts with current master label Oct 9, 2021

jbylund force-pushed the joe/warm_cache_in_threadpool branch from 8ee1589 to f5a70cc Compare October 12, 2021 20:54

pypa-bot removed the needs rebase or merge PR has conflicts with current master label Oct 12, 2021

uranusjr reviewed Oct 12, 2021

View reviewed changes

src/pip/_internal/resolution/resolvelib/resolver.py Outdated Show resolved Hide resolved

uranusjr reviewed Oct 12, 2021

View reviewed changes

jbylund force-pushed the joe/warm_cache_in_threadpool branch from 366b5d1 to b34933d Compare October 21, 2021 19:02

uranusjr approved these changes Oct 22, 2021

View reviewed changes

pradyunsg requested changes Jul 15, 2022

View reviewed changes

uranusjr requested a review from pradyunsg November 10, 2022 23:09

psf-chronographer bot added the bot:chronographer:provided label May 1, 2023

jbylund force-pushed the joe/warm_cache_in_threadpool branch 2 times, most recently from 10ca6a9 to f1922cc Compare May 3, 2023 22:56

uranusjr mentioned this pull request May 10, 2023

Upgrade vendored requests and urllib3 #12026

Closed

jbylund added 4 commits July 25, 2023 09:51

Pre-warm find_all_candidates cache in finder

37bd451

Move purging finder cache to the get resolver fixture

19c99e6

Switch from pip's old parallel utils to multiprocessing threadpool fo…

af49a5a

…r use in priming the resolver candidate cache

Remove error handler around find_all_candidates to see if it fails in…

6ddecdf

… github actions ci Run black over tests/unit/test_finder.py

jbylund force-pushed the joe/warm_cache_in_threadpool branch from 4e31045 to 6ddecdf Compare July 25, 2023 13:51

Merge branch 'main' into joe/warm_cache_in_threadpool

291aec8

pradyunsg removed the bot:chronographer:provided label Dec 20, 2023

psf-chronographer bot added the bot:chronographer:provided label Dec 20, 2023

ichard26 added the state: blocked Can not be done until something else is done label Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Populate find_all_candidates cache from threadpool #10480

Populate find_all_candidates cache from threadpool #10480

jbylund commented Sep 16, 2021

uranusjr commented Sep 20, 2021 •

edited

Loading

McSinyx commented Sep 21, 2021

uranusjr left a comment

jbylund commented Sep 21, 2021 •

edited

Loading

uranusjr commented Sep 22, 2021

jbylund commented Sep 29, 2021

jbylund commented Oct 4, 2021

uranusjr commented Oct 8, 2021

jbylund commented Oct 11, 2021

uranusjr commented Oct 11, 2021

jbylund commented Oct 12, 2021

uranusjr commented Oct 12, 2021 •

edited

Loading

jbylund commented Oct 12, 2021

uranusjr commented Oct 12, 2021

uranusjr Oct 12, 2021

jbylund commented Oct 21, 2021

uranusjr commented Oct 22, 2021

bluenote10 commented Jan 20, 2022 •

edited

Loading

pradyunsg left a comment

sethmlarson commented Jul 15, 2022

jbylund commented Jul 15, 2022

nateprewitt commented Jul 15, 2022

uranusjr commented Nov 10, 2022

jbylund commented Nov 10, 2022 •

edited

Loading

uranusjr commented Nov 10, 2022

sethmlarson commented May 1, 2023

uranusjr commented May 10, 2023

jbylund commented Jul 10, 2023

pfmoore commented Jul 10, 2023

jbylund commented Jul 10, 2023

pfmoore commented Jul 10, 2023

jbylund commented Sep 6, 2023

edmorley commented Sep 6, 2023 •

edited

Loading

cosmicexplorer commented Jan 11, 2024

		def test_resolver_cache_population(resolver: Resolver) -> None:
		resolver._finder.find_all_candidates.cache_clear()

Populate find_all_candidates cache from threadpool #10480

Are you sure you want to change the base?

Populate find_all_candidates cache from threadpool #10480

Conversation

jbylund commented Sep 16, 2021

uranusjr commented Sep 20, 2021 • edited Loading

McSinyx commented Sep 21, 2021

uranusjr left a comment

Choose a reason for hiding this comment

jbylund commented Sep 21, 2021 • edited Loading

uranusjr commented Sep 22, 2021

jbylund commented Sep 29, 2021

jbylund commented Oct 4, 2021

uranusjr commented Oct 8, 2021

jbylund commented Oct 11, 2021

uranusjr commented Oct 11, 2021

jbylund commented Oct 12, 2021

uranusjr commented Oct 12, 2021 • edited Loading

jbylund commented Oct 12, 2021

uranusjr commented Oct 12, 2021

uranusjr Oct 12, 2021

Choose a reason for hiding this comment

jbylund commented Oct 21, 2021

uranusjr commented Oct 22, 2021

bluenote10 commented Jan 20, 2022 • edited Loading

pradyunsg left a comment

Choose a reason for hiding this comment

sethmlarson commented Jul 15, 2022

jbylund commented Jul 15, 2022

nateprewitt commented Jul 15, 2022

uranusjr commented Nov 10, 2022

jbylund commented Nov 10, 2022 • edited Loading

uranusjr commented Nov 10, 2022

sethmlarson commented May 1, 2023

uranusjr commented May 10, 2023

jbylund commented Jul 10, 2023

pfmoore commented Jul 10, 2023

jbylund commented Jul 10, 2023

pfmoore commented Jul 10, 2023

jbylund commented Sep 6, 2023

edmorley commented Sep 6, 2023 • edited Loading

cosmicexplorer commented Jan 11, 2024

uranusjr commented Sep 20, 2021 •

edited

Loading

jbylund commented Sep 21, 2021 •

edited

Loading

uranusjr commented Oct 12, 2021 •

edited

Loading

bluenote10 commented Jan 20, 2022 •

edited

Loading

jbylund commented Nov 10, 2022 •

edited

Loading

edmorley commented Sep 6, 2023 •

edited

Loading