Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazier lazy_wheel #11481

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

lazier lazy_wheel #11481

wants to merge 9 commits into from

Conversation

dholth
Copy link
Member

@dholth dholth commented Oct 3, 2022

This version of lazy_wheel is returning to pip after a detour in https://pypi.org/project/conda-package-streaming/

It avoids the HEAD request. In conda-package-streaming we also explicitly prefetch the file we are interested in (by finding its bytes range from ZipInfo) and guarantee 2 or a maximum of 3 requests.

Reducing the # of requests makes a big difference when latency is high.

# The 416 response message contains a Content-Range indicating an
# unsatisfied range (that is a '*') followed by a '/' and the current
# length of the resource. E.g. Content-Range: bytes */12777
content_length = int(e.response.headers["content-range"].rsplit("/", 1)[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want some error handling here (and other places where we read the header) in case the server sends something invalid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors here probably devolve to "range request not supported" and should trigger a non-lazy retry.

Comment on lines +100 to +101
# lowercase content-range to support s3
self._length = int(tail.headers["content-range"].partition("/")[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought headers is a special case-insensitive mapping anyway, does lower-casing actually matter?

Copy link
Member Author

@dholth dholth Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used a requests -> boto3 adapter that is case sensitive, so for conda we can point this at s3 if wanted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is an S3 issue? I’d very much prefer this weird context to not get into pip. It’s OK we use lower-cased keys here (because it does not really matter for us), but the comment confuses things.

def __exit__(self, *exc: Any) -> None:
self._file.__exit__(*exc)
def __exit__(self, *exc: Any) -> bool | None:
print(self._request_count, "requests to fetch metadata from", self._url[107:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray debugging code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A test makes sure _request_count is <= the maximum of 3. It might be possible to construct a pathological wheel where the .dist-info files were all over, that would rather have > 3 requests...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which test are you referring to here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use log.debug() like we do elsewhere in this change?

headers = base_headers.copy()
headers["Range"] = f"bytes={start}-{end}"
log.debug("%s", headers["Range"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should make the log message more useful.

infolist = zf.infolist()
for info in infolist:
# should be (wheel filename without extension etc) + (.dist-info/)
if ".dist-info/" in info.filename:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there’s a better detection logic somewhere (probably in wheel installation code) for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pypa/wheel wheelfile.py it derives the .dist-info dirname from the parsed wheel filename. https://github.com/pypa/wheel/blob/main/src/wheel/wheelfile.py#L49, there probably is similar code in pip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is similar code existing in the code base, we should extract the common logic.

@dholth
Copy link
Member Author

dholth commented Feb 9, 2023

Anybody try this?

@pfmoore
Copy link
Member

pfmoore commented Feb 9, 2023

Not in pip, but I do keep meaning to extract the code and try it in my metadata downlaoder utility. Too many jobs, too little time, though 😉

@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label Mar 12, 2023
Copy link
Contributor

@cosmicexplorer cosmicexplorer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!!

@cosmicexplorer
Copy link
Contributor

@dholth after fixing a typo and trying this at https://github.com/cosmicexplorer/pip/tree/lazier-wheel, I'm finding this error when trying this against pypi:

> python3.8 -m pip -vvv install --report test.json --dry-run --ignore-installed --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
Collecting keras==2.4.3
  Obtaining dependency information from keras 2.4.3
  bytes=-10240
  Looking up "https://files.pythonhosted.org/packages/44/e1/dc0757b20b56c980b5553c1b5c4c32d378c7055ab7bfa92006801ad359ab/Keras-2.4.3-py2.py3-none-any.whl" in the cache
  Request header has "no-cache", cache bypassed
  Starting new HTTPS connection (1): files.pythonhosted.org:443
  https://files.pythonhosted.org:443 "GET /packages/44/e1/dc0757b20b56c980b5553c1b5c4c32d378c7055ab7bfa92006801ad359ab/Keras-2.4.3-py2.py3-none-any.whl HTTP/1.1" 501 462
ERROR: Could not install packages due to an OSError.
Traceback (most recent call last):
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/commands/install.py", line 377, in run
    requirement_set = resolver.resolve(
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
    result = self._result = resolver.resolve(
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_vendor/resolvelib/resolvers.py", line 397, in resolve
    self._add_to_criteria(self.state.criteria, r, parent=None)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria
    if not criterion.candidates:
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_vendor/resolvelib/structs.py", line 156, in __bool__
    return bool(self._sequence)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
    return any(self)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
    return (c for c in iterator if id(c) not in self._incompatible_ids)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
    candidate = func()
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
    self._link_candidate_cache[link] = LinkCandidate(
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/resolution/resolvelib/candidates.py", line 293, in __init__
    super().__init__(
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/resolution/resolvelib/candidates.py", line 156, in __init__
    self.dist = self._prepare()
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/resolution/resolvelib/candidates.py", line 225, in _prepare
    dist = self._prepare_distribution()
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/resolution/resolvelib/candidates.py", line 304, in _prepare_distribution
    return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/operations/prepare.py", line 532, in prepare_linked_requirement
    metadata_dist = self._fetch_metadata_only(req)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/operations/prepare.py", line 385, in _fetch_metadata_only
    ) or self._fetch_metadata_using_lazy_wheel(req.link)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/operations/prepare.py", line 452, in _fetch_metadata_using_lazy_wheel
    return dist_from_wheel_url(name, url, self._session)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/network/lazy_wheel.py", line 36, in dist_from_wheel_url
    with LazyZipOverHTTP(url, session) as zf:
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/network/lazy_wheel.py", line 67, in __init__
    tail = self._stream_response(start="", end=CONTENT_CHUNK_SIZE)
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_internal/network/lazy_wheel.py", line 210, in _stream_response
    response.raise_for_status()
  File "/home/cosmicexplorer/tools/pip/dist/pip-23.3.dev0-py3-none-any.whl/pip/_vendor/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
pip._vendor.requests.exceptions.HTTPError: 501 Server Error: Unsupported client range for url: https://files.pythonhosted.org/packages/44/e1/dc0757b20b56c980b5553c1b5c4c32d378c7055ab7bfa92006801ad359ab/Keras-2.4.3-py2.py3-none-any.whl

Apparently files.pythonhosted.org does not support Range: -10240 without knowing the file's complete size:

> curl -L -H 'Range: bytes=-10240' 'https://files.pythonhosted.org/packages/44/e1/dc0757b20b56c980b5553c1b5c4c32d378c7055ab7bfa92006801ad359ab/Keras-2.4.3-py2.py3-none-any.whl'

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>501 Unsupported client range</title>
  </head>
  <body>
    <h1>Error 501 Unsupported client range</h1>
    <p>Unsupported client range</p>
    <h3>Error 54113</h3>
    <p>Details: cache-ewr18138-EWR 1691348077 3195631842</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

Is Range: bytes=-N also a feature specific to S3, or is that an HTTP standard that Varnish should be respecting?

@dholth
Copy link
Member Author

dholth commented Aug 6, 2023 via email

@cosmicexplorer
Copy link
Contributor

@dholth It looks like prefetch_dist_info() does most of the work of reducing the number of separate requests made to extract the metadata info here; would you accept a version which still employs an initial HEAD request, or am I missing something that allowed this to work against pypi in #8670 (comment)?

@cosmicexplorer
Copy link
Contributor

Negative end-of-file ranges are standard http. Real shame if proxy rejects
them.

Ah, thanks so much for clarifying this; that is indeed a shame, although for pypi itself we now have PEP 658 metadata being generated at <url>.metadata for most wheels (not the keras one I just tested against, though). However, fast-deps and this change are still necessary for --find-links repos and indices that don't extract and serve metadata like pypi does.

@dholth
Copy link
Member Author

dholth commented Aug 6, 2023

Pypi used to support http range from end of file. IIUC the two requests minimum "with head request" defeats the speedup for high latency but my own Internet is low latency.

Maybe after the redirect, fastly can accept negative range?

Full range request can even support multiple ranges in one request! This is too fancy for me. S3 supports a maximum of one range per request.

@cosmicexplorer
Copy link
Contributor

Thanks so much for that context! The merge conflicts on this PR are extremely trivial, so I'm going to try adding a fallback to HEAD for when negative range requests aren't supported, and expand test cases to cover both with and without negative byte ranges (probably making use of the mock server for testing metadata downloads added in #12197). I'll post here when I've done that.

@cosmicexplorer
Copy link
Contributor

Ok, I just created #12208. I'm pretty confident I'll have the bandwidth to help get this in soon (I did the same e.g. with #12186), but if I don't succeed in getting it merged soon and you find time yourself, feel free to take it back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs rebase or merge PR has conflicts with current master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants