persistent cache for link parsing and interpreter compatibility #12258

cosmicexplorer · 2023-09-03T12:10:10Z

Closes #12184.

This change is on top of #12257, see the +264/-116 diff against at https://github.com/cosmicexplorer/pip/compare/link-parsing-cache...cosmicexplorer:pip:interpreter-compatibility-cache?expand=1.

Background: Caching `Link` Parsing In-Memory

In #7729, we cached the result of parsing an index for Links in memory, to avoid redoing it more than once per run. Given that this result changes as packages upload new versions to pypi, it didn't seem feasible to cache this persistently like we can do with #12256.

Proposal: Cache `Link` Parsing, then Interpreter Compatibility

See #12184 for an investigation into caching strategies.

If (as per #12257) we recognize that we have a cached HTTP response (which means no new versions of a package have been uploaded since we last checked), we can then implement two more levels of caching:

We can cache the HTML parsing and scanning for valid Links, and write that to a json file in the cache directory.
After we scan all the resulting Links to find the ones that match the current interpreter/platform, we can write that to a cache file as well (we have to ensure these are all invalidated when the HTTP response changes).

Result: Some Performance Improvement

This all turns out to be possible, but with some caveats:

It takes up a little more filesystem space (2.5M) than cache metadata lookups for sdists and lazy wheels #12256 or send HTTP caching headers for index pages to further reduce bandwidth usage #12257 (which, for comparison, both only wrote an additional 2KB and 60KB to disk after the same large resolve):

> du -b -s ~/.cache/pip/fetch-resolve/
2536640 /home/cosmicexplorer/.cache/pip/fetch-resolve/
> find ~/.cache/pip/fetch-resolve/ -name parsed-links | parallel bzcat | jq '.[]'
...
{
  "url": "https://files.pythonhosted.org/packages/59/c0/85c204545c4b32f9f4e6b5e49faf4311d0610815aa98aa0cbb5c67f3ea85/keras_nightly-2.5.0.dev2021021400-py2.py3-none-any.whl",
  "comes_from": "https://pypi.org/simple/keras-nightly/",
  "requires_python": null,
  "yanked_reason": null,
  "metadata_file_data": null,
  "hashes": {
    "sha256": "a46f37cd4486d8d0802f91230611bc1ffe1030762cdb8461c8cd807f4d67f20c"
  }
}
> find ~/.cache/pip/fetch-resolve/ -name '*.evaluation' | parallel bzcat | jq '.[]'
...
{
  "name": "astunparse",
  "version": "1.6.3",
  "link": {
    "url": "https://files.pythonhosted.org/packages/f3/af/4182184d3c338792894f34a62672919db7ca008c89abee9b564dd34d8029/astunparse-1.6.3.tar.gz",
    "comes_from": "https://pypi.org/simple/astunparse/",
    "requires_python": null,
    "yanked_reason": null,
    "metadata_file_data": null,
    "hashes": {
      "sha256": "5ad93a8456f0d084c3456d059fd9a92cce667963232cbf763eac3bc5b7940872"
    }
  }
}

It has only a mild performance benefit, converting a 5.9 second resolve to 3.9 seconds (comparing send HTTP caching headers for index pages to further reduce bandwidth usage #12257 vs this PR):
- This is technically a 1.5x speedup, but this doesn't scale linearly, and will only ever shave 2 seconds off the runtime.

# check out #12257
> git checkout link-parsing-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m5.940s
user    0m4.323s
sys     0m0.131s
> git checkout interpreter-compatibility-cache # apply the change in this PR
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m3.937s
user    0m2.774s
sys     0m0.169s

The implementation is a little complex.

Discussion

I've made this a draft because I'd like to gather opinions on whether this would be worth merging.

When performing `install --dry-run` and PEP 658 .metadata files are available to guide the resolve, do not download the associated wheels. Rather use the distribution information directly from the .metadata files when reporting the results on the CLI and in the --report file. - describe the new --dry-run behavior - finalize linked requirements immediately after resolve - introduce is_concrete - funnel InstalledDistribution through _get_prepared_distribution() too - add test for new install --dry-run functionality (no downloading)

- catch an exception when parsing metadata which only occurs in CI - handle --no-cache-dir - call os.makedirs() before writing to cache too - catch InvalidSchema when attempting git urls with BatchDownloader - fix other test failures - reuse should_cache(req) logic - gzip compress link metadata for a slight reduction in disk space - only cache built sdists - don't check should_cache() when fetching - cache lazy wheel dists - add news - turn debug logs in fetching from cache into exceptions - use scandir over listdir when searching normal wheel cache - handle metadata email parsing errors - correctly handle mutable cached requirement - use bz2 over gzip for an extremely slight improvement in disk usage

- pipe in headers arg - provide full context in Link.comes_from - pull in etag and date and cache the outputs - handle --no-cache-dir - add NEWS - remove quotes from etag and use binary checksum to save a few bytes - parse http modified date to compress the cached representation

- also compress the link parsing

psf-chronographer bot added the bot:chronographer:provided label Sep 3, 2023

cosmicexplorer force-pushed the interpreter-compatibility-cache branch from c5c6466 to c468bc3 Compare September 3, 2023 12:19

cosmicexplorer mentioned this pull request Sep 3, 2023

memoizing dependency lookups with a local resolve index #12184

Closed

cosmicexplorer force-pushed the interpreter-compatibility-cache branch 5 times, most recently from 2fa96fc to 9ab73d3 Compare September 3, 2023 23:43

pradyunsg removed the bot:chronographer:provided label Dec 20, 2023

psf-chronographer bot added the bot:chronographer:provided label Dec 20, 2023

cosmicexplorer mentioned this pull request Jan 11, 2024

perform 1-3 HTTP requests for each wheel using fast-deps #12208

Open

cosmicexplorer force-pushed the interpreter-compatibility-cache branch 2 times, most recently from 34511b7 to 58a145a Compare January 11, 2024 06:20

cosmicexplorer mentioned this pull request Jan 11, 2024

Populate find_all_candidates cache from threadpool #10480

Open

cosmicexplorer force-pushed the interpreter-compatibility-cache branch 4 times, most recently from eeccc32 to 7a17fd3 Compare January 17, 2024 04:43

ddelange mentioned this pull request Mar 5, 2024

Expose the METADATA file of wheels in the simple API pypi/warehouse#8254

Closed

cosmicexplorer mentioned this pull request Mar 5, 2024

Consider using uv as an optional alternate resolver. pex-tool/pex#2371

Open

cosmicexplorer mentioned this pull request Aug 13, 2024

Pip install --dry-run shouldn't download full wheels when metadata file available #12603

Open

1 task

cosmicexplorer added 6 commits August 13, 2024 05:34

cache "concrete" dists by Distribution instead of InstallRequirement

ad93d54

refactor much of RequirementPreparer and add lengthy docs

52133ba

Remove the in-memory caching of parse_links()

dbacf99

cosmicexplorer force-pushed the interpreter-compatibility-cache branch from 7a17fd3 to f007a62 Compare August 13, 2024 10:56

cache Link parsing and interpreter compatibility checking

f007a62

- also compress the link parsing

cosmicexplorer mentioned this pull request Aug 17, 2024

metadata resolve workstream #12921

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persistent cache for link parsing and interpreter compatibility #12258

persistent cache for link parsing and interpreter compatibility #12258

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

persistent cache for link parsing and interpreter compatibility #12258

Are you sure you want to change the base?

persistent cache for link parsing and interpreter compatibility #12258

Conversation

cosmicexplorer commented Sep 3, 2023 • edited Loading

Background: Caching Link Parsing In-Memory

Proposal: Cache Link Parsing, then Interpreter Compatibility

Result: Some Performance Improvement

Discussion

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

Background: Caching `Link` Parsing In-Memory

Proposal: Cache `Link` Parsing, then Interpreter Compatibility