Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persistent cache for link parsing and interpreter compatibility #12258

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Sep 3, 2023

Closes #12184.

This change is on top of #12257, see the +264/-116 diff against at https://github.com/cosmicexplorer/pip/compare/link-parsing-cache...cosmicexplorer:pip:interpreter-compatibility-cache?expand=1.

Background: Caching Link Parsing In-Memory

In #7729, we cached the result of parsing an index for Links in memory, to avoid redoing it more than once per run. Given that this result changes as packages upload new versions to pypi, it didn't seem feasible to cache this persistently like we can do with #12256.

Proposal: Cache Link Parsing, then Interpreter Compatibility

See #12184 for an investigation into caching strategies.

If (as per #12257) we recognize that we have a cached HTTP response (which means no new versions of a package have been uploaded since we last checked), we can then implement two more levels of caching:

  1. We can cache the HTML parsing and scanning for valid Links, and write that to a json file in the cache directory.
  2. After we scan all the resulting Links to find the ones that match the current interpreter/platform, we can write that to a cache file as well (we have to ensure these are all invalidated when the HTTP response changes).

Result: Some Performance Improvement

This all turns out to be possible, but with some caveats:

  1. It takes up a little more filesystem space (2.5M) than cache metadata lookups for sdists and lazy wheels #12256 or send HTTP caching headers for index pages to further reduce bandwidth usage #12257 (which, for comparison, both only wrote an additional 2KB and 60KB to disk after the same large resolve):
> du -b -s ~/.cache/pip/fetch-resolve/
2536640 /home/cosmicexplorer/.cache/pip/fetch-resolve/
> find ~/.cache/pip/fetch-resolve/ -name parsed-links | parallel bzcat | jq '.[]'
...
{
  "url": "https://files.pythonhosted.org/packages/59/c0/85c204545c4b32f9f4e6b5e49faf4311d0610815aa98aa0cbb5c67f3ea85/keras_nightly-2.5.0.dev2021021400-py2.py3-none-any.whl",
  "comes_from": "https://pypi.org/simple/keras-nightly/",
  "requires_python": null,
  "yanked_reason": null,
  "metadata_file_data": null,
  "hashes": {
    "sha256": "a46f37cd4486d8d0802f91230611bc1ffe1030762cdb8461c8cd807f4d67f20c"
  }
}
> find ~/.cache/pip/fetch-resolve/ -name '*.evaluation' | parallel bzcat | jq '.[]'
...
{
  "name": "astunparse",
  "version": "1.6.3",
  "link": {
    "url": "https://files.pythonhosted.org/packages/f3/af/4182184d3c338792894f34a62672919db7ca008c89abee9b564dd34d8029/astunparse-1.6.3.tar.gz",
    "comes_from": "https://pypi.org/simple/astunparse/",
    "requires_python": null,
    "yanked_reason": null,
    "metadata_file_data": null,
    "hashes": {
      "sha256": "5ad93a8456f0d084c3456d059fd9a92cce667963232cbf763eac3bc5b7940872"
    }
  }
}
  1. It has only a mild performance benefit, converting a 5.9 second resolve to 3.9 seconds (comparing send HTTP caching headers for index pages to further reduce bandwidth usage #12257 vs this PR):
    • This is technically a 1.5x speedup, but this doesn't scale linearly, and will only ever shave 2 seconds off the runtime.
# check out #12257
> git checkout link-parsing-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m5.940s
user    0m4.323s
sys     0m0.131s
> git checkout interpreter-compatibility-cache # apply the change in this PR
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m3.937s
user    0m2.774s
sys     0m0.169s
  1. The implementation is a little complex.

Discussion

I've made this a draft because I'd like to gather opinions on whether this would be worth merging.

@cosmicexplorer cosmicexplorer force-pushed the interpreter-compatibility-cache branch 5 times, most recently from 2fa96fc to 9ab73d3 Compare September 3, 2023 23:43
@cosmicexplorer cosmicexplorer force-pushed the interpreter-compatibility-cache branch 2 times, most recently from 34511b7 to 58a145a Compare January 11, 2024 06:20
@cosmicexplorer cosmicexplorer force-pushed the interpreter-compatibility-cache branch 4 times, most recently from eeccc32 to 7a17fd3 Compare January 17, 2024 04:43
When performing `install --dry-run` and PEP 658 .metadata files are
available to guide the resolve, do not download the associated wheels.

Rather use the distribution information directly from the .metadata
files when reporting the results on the CLI and in the --report file.

- describe the new --dry-run behavior
- finalize linked requirements immediately after resolve
- introduce is_concrete
- funnel InstalledDistribution through _get_prepared_distribution() too
- add test for new install --dry-run functionality (no downloading)
- catch an exception when parsing metadata which only occurs in CI
- handle --no-cache-dir
- call os.makedirs() before writing to cache too
- catch InvalidSchema when attempting git urls with BatchDownloader
- fix other test failures
- reuse should_cache(req) logic
- gzip compress link metadata for a slight reduction in disk space
- only cache built sdists
- don't check should_cache() when fetching
- cache lazy wheel dists
- add news
- turn debug logs in fetching from cache into exceptions
- use scandir over listdir when searching normal wheel cache
- handle metadata email parsing errors
- correctly handle mutable cached requirement
- use bz2 over gzip for an extremely slight improvement in disk usage
- pipe in headers arg
- provide full context in Link.comes_from
- pull in etag and date and cache the outputs
- handle --no-cache-dir
- add NEWS
- remove quotes from etag and use binary checksum to save a few bytes
- parse http modified date to compress the cached representation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

memoizing dependency lookups with a local resolve index
2 participants