Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Negative Offset Range Requests on files.pythonhosted.org #12823

Closed
EmmaJaneBonestell opened this issue Jan 12, 2023 · 15 comments
Closed
Labels
CDN/network Issues related to our CDN, users having problems connecting to PyPI

Comments

@EmmaJaneBonestell
Copy link

EmmaJaneBonestell commented Jan 12, 2023

Describe the bug
While absolute range headers (e.g. bytes=0-64) work properly, negative values in a range header return a 501 not implemented error (with a nonstandard message stating "Unsupported client range", even though range requests are otherwise supported.)

Expected behavior
Negative values are supposed to return a content-range offset from the end of the fille to the end. I.e. bytes=-64 should give content-range={content-length - 64}-{content-length}.

To Reproduce
Any range request may illustrate this. Here's a curl command.

curl -i -H "Range: bytes=-64" "https://files.pythonhosted.org/packages/09/bd/2410905c76ee14c62baf69e3f4aa780226c1bbfc9485731ad018e35b0cb5/pip-22.3.1-py3-none-any.whl"

Returns:

HTTP/2 501 
...
content-type: text/html; charset=utf-8
accept-ranges: bytes
...
content-length: 470


<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>501 Unsupported client range</title>
  </head>
  <body>
    <h1>Error 501 Unsupported client range</h1>
    <p>Unsupported client range</p>
    <h3>Error 54113</h3>
    <p>Details: (removed for privacy) </p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

Should return:

one-any.whl"
HTTP/2 206 
...
x-goog-stored-content-encoding: identity
x-goog-stored-content-length: 2051534
content-type: application/octet-stream
accept-ranges: bytes
...
access-control-allow-methods: GET, OPTIONS
access-control-allow-headers: Range
access-control-allow-origin: *
x-pypi-file-python-version: py3
x-pypi-file-version: 22.3.1
x-pypi-file-package-type: bdist_wheel
x-pypi-file-project: pip
content-range: bytes 2051470-2051533/2051534
content-length: 64

Additional context
If this is explicitly not implemented for some reason, please consider returning a 405 (Method Not Allowed) response instead, since 501 implies the request is not recognized.
501 is appropriate. Ignore me on that (:

@EmmaJaneBonestell EmmaJaneBonestell added bug 🐛 requires triaging maintainers need to do initial inspection of issue labels Jan 12, 2023
@ewdurbin ewdurbin added CDN/network Issues related to our CDN, users having problems connecting to PyPI and removed requires triaging maintainers need to do initial inspection of issue bug 🐛 labels Jan 13, 2023
@ewdurbin
Copy link
Member

HTTP 501 indicates "Not Implemented" so that seems semantically correct. Range support is implemented by our CDN. I don't think we'd consider this a bug per se.

I'm not sure how much (if any) control we have over changing this response for negative offsets, so I don't think this is likely to change.

@EmmaJaneBonestell
Copy link
Author

@ewdurbin
I understand. Thanks for the prompt response anyway. (And apologies for filing as a bug; I wasn't going to, but when checking if it was already reported, I saw a similar issue filed as a bug.)

@njsmith
Copy link
Contributor

njsmith commented Jan 20, 2023

@ewdurbin So I just ran into this as well, and..... I didn't change my code. So this worked a few weeks ago, but just stopped working?

And I just checked some other package repos that use Fastly as their CDN, and they seem to be supporting it fine. It's only pythonhosted.org that's erroring out:

❯ http https://rubygems.org/downloads/fastly-4.0.0.gem range:bytes=-10
HTTP/1.1 206 Partial Content
Accept-Ranges: bytes
Age: 3618959
Cache-Control: max-age=31536000
Connection: keep-alive
Content-Length: 10
Content-Range: bytes 911862-911871/911872
Content-Type: binary/octet-stream
Date: Fri, 20 Jan 2023 11:13:18 GMT
ETag: "5dafe4c4090d8356f01c40482d859083"
Last-Modified: Fri, 09 Dec 2022 13:56:06 GMT
Server: RubyGems.org
Via: 1.1 varnish, 1.1 varnish
X-Backend: F_S3 3.5.82.178:443, fastlyshield--shield_ssl_cache_bfi_krnt7300089_BFI 167.82.143.89:443
X-Cache: HIT, HIT
X-Cache-Hits: 22, 0
X-Served-By: cache-bfi-krnt7300089-BFI, cache-sjc10027-SJC
X-Timer: S1674213198.372793,VS0,VE1
x-amz-id-2: adakGQsWWTCKqcLh6RJBZ6R3jsaF9eUvEjP2yTypfC/SkrY29jhRlKxDje9bThF23wvVhjA0DdTubxFlxyUe6Q==
x-amz-replication-status: COMPLETED
x-amz-request-id: H8SZZ3N4PV0DFHEQ
x-amz-version-id: PZEhRw5eG5oNtEeU6EQMfTUWBtcN3ENT



+-----------------------------------------+
| NOTE: binary data not shown in terminal |
+-----------------------------------------+


~ 
❯ http https://hackage.haskell.org/package/ghc-9.4.4/ghc-9.4.4.tar.gz range:bytes=-10
HTTP/1.1 206 Partial Content
Accept-Ranges: bytes
Age: 411
Cache-Control: public, no-transform, max-age=2592000
Connection: keep-alive
Content-Length: 10
Content-MD5: wcAzB7pxOV0QGMYvRVGapw==
Content-Range: bytes 5125255-5125264/5125265
Content-Type: application/x-gzip
Date: Fri, 20 Jan 2023 11:13:23 GMT
ETag: "c1c03307ba71395d1018c62f45519aa7"
Last-modified: Sat, 31 Dec 2022 18:44:10 GMT
Server: nginx/1.18.0 (Ubuntu)
Via: 1.1 varnish
X-Cache: HIT
X-Cache-Hits: 0
X-Served-By: cache-sjc10072-SJC
X-Timer: S1674213203.447684,VS0,VE0



+-----------------------------------------+
| NOTE: binary data not shown in terminal |
+-----------------------------------------+


~ 
❯ http https://files.pythonhosted.org/packages/6e/77/7b69133bf0f3a6b0000cdb6133ff5292734182ca0cd107ad7ff4c46e7bc1/numpy-1.24.1-cp310-cp310-macosx_10_9_x86_64.whl range:bytes=-10
HTTP/1.1 501 Unsupported client range
Accept-Ranges: bytes
Connection: close
Content-Length: 463
Content-Type: text/html; charset=utf-8
Date: Fri, 20 Jan 2023 11:13:41 GMT
Retry-After: 0
Server: Varnish
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Cache: MISS
X-Cache-Hits: 0
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-Permitted-Cross-Domain-Policies: none
X-Robots-Header: noindex
X-Served-By: cache-sjc10067-SJC
X-Timer: S1674213222.925774,VS0,VE0
X-XSS-Protection: 1; mode=block

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>501 Unsupported client range</title>
  </head>
  <body>
    <h1>Error 501 Unsupported client range</h1>
    <p>Unsupported client range</p>
    <h3>Error 54113</h3>
    <p>Details: cache-sjc10067-SJC 1674213222 3968417666</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

So this seems to be a recent, PyPI-specific regression?

(Also as a side note, if there's an error here, it really should be 416 Range Not Satisfiable. The problem with 501 is that while it's not exactly wrong, it's a very generic "something is broken" error, so clients will usually give up, while code that's attempting range requests should have smart fallback logic for 416, like switching to requesting the whole file.)

@di
Copy link
Member

di commented Jan 20, 2023

@njsmith Any idea when it last worked? On Dec 12th we enabled segmented caching at the request of Fastly, and I wonder if this is a side effect.

@ewdurbin
Copy link
Member

Hmmm, yep. Confirmed by reverting Segmented Caching rollout on test-files.pythonhosted.org temporarily, which restored negative offset support:

Before Segmented Caching:

$ curl -H"Range: -64" https://test-files.pythonhosted.org/packages/b7/b6/7e000d22d54f3c8939b366a0c85f2bbfcf973ce18b894a829b882a9109e5/hi-ml-0.1.1.post1159.tar.gz
Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

After:

$ curl -H"Range: -64" https://test-files.pythonhosted.org/packages/b7/b6/7e000d22d54f3c8939b366a0c85f2bbfcf973ce18b894a829b882a9109e5/hi-ml-0.1.1.post1159.tar.gz

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>501 Unsupported client range</title>
  </head>
  <body>
    <h1>Error 501 Unsupported client range</h1>
    <p>Unsupported client range</p>
    <h3>Error 54113</h3>
    <p>Details: cache-bur-kbur8200118-BUR 1674228917 2094377810</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

@ewdurbin
Copy link
Member

Fastly considered getting us moved over to Segmented Caching for files.pythonhosted.org urgent, so I don't see us reverting that.

@dstufft
Copy link
Member

dstufft commented Jan 20, 2023

It's possible we can mention it to Fastly and it's something they might be able to implement on their end, but even if they were, no idea what time frame that would be.

@njsmith
Copy link
Contributor

njsmith commented Jan 20, 2023

$ curl -H"Range: -64"

The (very confusing) syntax for range headers is Range: bytes=-64, so I think your pasted results are technically showing differences in error handling – the no-segmented-cache system responds to invalid Range: headers by sending the full body (a 200 response), while the segmented-cache system responds them by giving an error. But I think you're right that segmented caching must be the issue.

We probably should send Fastly a bug report about this? I feel like CDN companies generally like to know when their HTTP implementation has a bug, even if it's not a show-stopper :-).

BTW, the reason this matters is because it's useful for wheel METADATA fetches, like pip does here: https://github.com/pypa/pip/blob/main/src/pip/_internal/network/lazy_wheel.py
To fetch a specific file out of a zip without downloading the whole thing, you start by reading the zip index that's at the end of the file, so the first thing you need to do is to fetch the end of the file. Pip's implementation does this by first doing a HEAD request to find out how long the file is, and then doing a Range request to get the end, but using two round trips for this is inefficient when you could accomplish both with a single negative range request. So... there is a workaround, and when PEP 658 (#8254) support lands it'll let us skip this whole dance anyway. And I guess it's lucky that pip was inefficient, because turning on Segmented Caching would have been a disaster!

(See also: gtsystem/python-remotezip#15)

@ewdurbin
Copy link
Member

Support request has been submitted to Fastly.

@ewdurbin
Copy link
Member

Fastly support confirms that Segmented Caching does not support such negative offsets.

In addition they offered responses regarding suggested HTTP status codes for the response:

I got some details on the proposed responses from the linked Github page.

"405 Method Not Allowed" would not be the appropriate status code, its purpose is, quoting from https://www.rfc-editor.org/rfc/rfc7231#section-6.5.5, to indicate:
that the method received in the request-line is known by the origin server but not supported by the target resource.

"416 Range Not Satisfiable" on the other hand would be valid, quoting from https://www.rfc-editor.org/rfc/rfc7233#section-4.4, it indicates:
that none of the ranges in the request's Range header field (Section 3.1) overlap the current extent of the selected resource or that the set of ranges requested has been rejected due to invalid ranges or an excessive request of small or overlapping ranges.

Another valid handling would be to ignore an invalid Range and to send a status 200 response with the whole object, which is what e.g. apache httpd does in a case like this, but that would be a very wasteful response to what is basically an invalid request.

There is a wrinkle with sending a status 416 response because a valid status 416 response is supposed to have a Content-Range response header field with the size of the complete object filled in. That means that we need to access at least one fragment in order to be able to send a valid status 416 response. That again seems a bit wasteful a response to what is basically an invalid request.

@ewdurbin
Copy link
Member

ewdurbin commented Jan 23, 2023

Seems our only recourse would be to write custom VCL to handle the requests, either

  1. Disabling segmented caching for such requests, which would be problematic for Fastly (who asked us to enable it in the first place).
  2. Disabling caching and passing these requests directly to our backends. Which would lead to much more traffic to arbitrary traffic to our origins, defeating the purpose of our CDN for files.pythonhosted.org.
  3. Creating a synthetic response at the edge.

It seems the workaround making two requests (HEAD, then GET with Range:) is valid enough. Should we consider documenting that along side the lack of support for these negative offset requests and the potentially confusing error responses?

@di
Copy link
Member

di commented Jan 23, 2023

I'm trying to understand the use case here: is it just to retrieve metadata? Would working towards implementing PEP 658 be a better use of our time?

@njsmith
Copy link
Contributor

njsmith commented Jan 23, 2023

That again seems a bit wasteful a response to what is basically an invalid request.

My engineer-brain insists that I point out: this is a weird way to describe an RFC-standard request that Fastly themselves support in their normal mode :-) Cf. RFC 7233 page 5:

   A client can request the last N bytes of the selected representation
   using a suffix-byte-range-spec.

     suffix-byte-range-spec = "-" suffix-length
     suffix-length = 1*DIGIT

is it just to retrieve metadata?

yes

Would working towards implementing PEP 658 be a better use of our time?

yes

@dstufft
Copy link
Member

dstufft commented May 23, 2023

I'm going to close this. I don't think there's anything we can do here to move this issue forward, and PEP 658 is rolling out now which is a better way to handle this anyways.

@cosmicexplorer
Copy link

To corroborate the above, while pip will already prefer PEP 658 metadata when available, pypa/pip#12208 will fall back to performing a HEAD request to extract the Content-Length if it receives a 501 Not Implemented upon attempting a negative offset range (all according to the HTTP standard): see https://github.com/cosmicexplorer/pip/blob/e6f3f1aeec87fa70d4b8759478462e67a14b0a91/src/pip/_internal/network/lazy_wheel.py#L286-L303. Further range requests which explicitly specify the start and end of the range will then succeed. Non-pip resolvers which also want to use the fast-deps technique when PEP 658 metadata is not available are recommended to do the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDN/network Issues related to our CDN, users having problems connecting to PyPI
Projects
None yet
Development

No branches or pull requests

6 participants