Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-Length is not checked, resulting in short reads with no error #4956

Closed
debugmiller opened this issue Feb 1, 2019 · 9 comments
Closed

Comments

@debugmiller
Copy link

debugmiller commented Feb 1, 2019

Currently, requests does not detect when the length of a response body does not match the Content-Length header. This behavior results in undetected failures when a TCP connection is closed early (for example) and is contrary to the behavior of other tools (e.g. curl) and libraries (e.g. reqwest) which fail if a body length does not match Content-Length [3]. It is also contrary to the HTTP 1.1 RFC [5]. Frustration with this behavior is well documented ([1][2][3][4]) and was first reported as a bug 5 years ago [5]. In [1] it was agreed that the underlying checking should be done in urllib3. This was implemented in urllib3 v1.17 with the enforce_content_length setting [6]. However the setting defaults to False. At the time this was implemented in urllib3, there was a merge on requests:proposed/3.0.0 [7] that set enforce_content_length to True so that short reads would be detected (it was later reverted edit: it wasn't reverted), but no such merge happened on requests 2.x branch.

Requests master now requires urllib3>=1.21 so we are guaranteed to have enforce_content_length. There is a strong argument to be made that enforce_content_length=True should be the default but at the very least it should be possible for a user to opt into this.

[1] #2833
[2] #2275
[3] https://blog.petrzemek.net/2018/04/22/on-incomplete-http-reads-and-the-requests-library-in-python/
[4] https://news.ycombinator.com/item?id=16896899
[5] #1855
[6] urllib3/urllib3#949
[7] #3563.

Expected Result

HTTP bodies that do not match the HTTP Content-Length header should be detected

Actual Result

HTTP bodies that do not match the HTTP Content-Length header are not detected.

Reproduction Steps

#2833 (comment)

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.7"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.0"
  },
  "platform": {
    "release": "18.2.0",
    "system": "Darwin"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.21.0"
  },
  "system_ssl": {
    "version": "1000211f"
  },
  "urllib3": {
    "version": "1.24"
  },
  "using_pyopenssl": false
}
@debugmiller
Copy link
Author

I should also add that I'm happy to port over #3563 but I'm unclear on why it was reverted or if the approach applies to requests 2.x

@debugmiller
Copy link
Author

I have realized it wasn't reverted, I misread the PR that I thought reverted it.

@ExplodingCabbage
Copy link

ExplodingCabbage commented Jan 11, 2020

I can confirm that Requests ignores Content-Length despite the fact that both urllib3 and the built-in http module check it, when used in the most obvious possible ways (and despite the fact that unrelated tools for making HTTP requests, like Curl, check it).

For a simple demonstration, first install Flask, copy the following code into a file, and run it:

from flask import Flask, make_response

app = Flask(__name__)

@app.route('/test', methods=['GET', 'POST'])
def send_incomplete_response():
    response = make_response('fourteen chars')
    response.headers['Content-Length'] = '10000'
    return response

app.run()

You're now running a server on localhost that, when you make a request to http://localhost:5000/test, returns a response with a 14-byte-long body but with a Content-Length header claiming that the body contains 10000 bytes. Now observe how different tools handle making a request to that endpoint:

mark@fractal:~$ curl http://localhost:5000/test
curl: (18) transfer closed with 9986 bytes remaining to read
fourteen charsmark@fractal:~$ 
mark@fractal:~$ python3.8
Python 3.8.0 (default, Oct 28 2019, 16:14:01) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import http.client
>>> conn = http.client.HTTPConnection('localhost', 5000)
>>> conn.request('GET', '/test')
>>> response = conn.getresponse()
>>> response.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/http/client.py", line 467, in read
    s = self._safe_read(self.length)
  File "/usr/lib/python3.8/http/client.py", line 610, in _safe_read
    raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(14 bytes read, 9986 more expected)
>>> 
>>> import urllib3
>>> pool_manager = urllib3.PoolManager()
>>> response = pool_manager.request('GET', 'http://localhost:5000/test')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 503, in read
    data = self._fp.read() if not fp_closed else b""
  File "/usr/lib/python3.8/http/client.py", line 467, in read
    s = self._safe_read(self.length)
  File "/usr/lib/python3.8/http/client.py", line 610, in _safe_read
    raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(14 bytes read, 9986 more expected)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 685, in urlopen
    response = self.ResponseCls.from_httplib(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 589, in from_httplib
    resp = ResponseCls(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 258, in __init__
    self._body = self.read(decode_content=decode_content)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 443, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(14 bytes read, 9986 more expected)', IncompleteRead(14 bytes read, 9986 more expected))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/urllib3/request.py", line 75, in request
    return self.request_encode_url(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/request.py", line 97, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/poolmanager.py", line 330, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 747, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 747, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 747, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /test (Caused by ProtocolError('Connection broken: IncompleteRead(14 bytes read, 9986 more expected)', IncompleteRead(14 bytes read, 9986 more expected)))
>>> 
>>> import requests
>>> response = requests.get('http://localhost:5000/test')
>>> response.text
'fourteen chars'

Used in the most straightforward possible way:

  • Curl outputs the response, but warns us about the content length not matching what was claimed in the header
  • http.client raises an IncompleteRead
  • urllib3 raises a MaxRetryError caused by an IncompleteRead
  • requests happily gives us the incomplete response with no indication that anything went wrong.

Whoops. So, why did this happen? Requests is built on top of urllib3 which is built on top of http.client, and Requests isn't explicitly catching and discarding an IncompleteRead error anywhere, so how can it fail to catch an error that the others emit?

The answer is that there are possible ways to use both http.client and urllib3 that prevent this error from being raised, and Requests is using them. Starting from the bottom...

How http is broken:

While http raises an IncompleteRead if you call .read() with no argument on an incomplete http.client.HTTPResponse object, it doesn't do so if you pass an argument to .read(). This is true whether you're reading a bit at a time or everything at once:

>>> import http.client
>>> 
>>> conn = http.client.HTTPConnection('localhost', 5000)
>>> conn.request('GET', '/test')
>>> response = conn.getresponse()
>>> response.read(1000)
b'fourteen chars'
>>> 
>>> conn = http.client.HTTPConnection('localhost', 5000)
>>> conn.request('GET', '/test')
>>> response = conn.getresponse()
>>> while True:
...     next_byte = response.read(1)
...     if next_byte == b'':
...         break
...     print(next_byte)
... 
b'f'
b'o'
b'u'
b'r'
b't'
b'e'
b'e'
b'n'
b' '
b'c'
b'h'
b'a'
b'r'
b's'

This was a deliberate decision; the handling of reads like this was apparently even more broken before (per https://bugs.python.org/issue15633, back in 2013 they not only wouldn't raise but also wouldn't mark the connection as closed), and @pitrou partially fixed it in
python/cpython@beec61a#diff-b1e875fb060aa473c57a8e87bf903a76R514-R517 but refrained from raising over it to avoid breaking backwards compatibility - presumably with programs that were relying on the old behaviour to be able to request stuff from servers that spit out erroneous Content-Length headers.

How urllib3 is broken

urllib3 used to always fail to raise an IncompleteRead error in this situation, because it calls the aforementioned http.client.HTTPResponse.read(...) method with an argument rather than reading everything at once.

This was understood and then sort of fixed by @nateprewitt in urllib3/urllib3#949, by having urllib3 independently check, in the .read() method of the urllib3.response.HTTPResponse class, whether the number of bytes read matches the Content-Length header, and explicitly raise a urllib3.exceptions.IncompleteRead otherwise.

(Edit note: next paragraph and everything after it rewritten. I got some stuff quite wrong here at first.)

But that check isn't enabled by default; in urllib3, enforce_content_length defaults to False. So, unless you explicitly ask for it, the check is never applied. However, you get some protection from the fact that, by default, a HTTPResponse ins constructed with preload_content=True, which causes it to call self.read() without an argument, which in turn calls http.client.HTTPResponse.read() without an argument, which is the scenario in which http.client.HTTPResponse.read() enforces Content-Length properly.

However, things go wrong when we use the .stream() method. If we construct a response object with preload_content=False, we can call .stream() on it, which defers to .read(), but calls it with a length argument. This length argument gets passed through to http.client.HTTPResponse.read(), thus invoking the scenario where http.client does not enforce Content-Length. Thus, for a user who isn't using enforce_content_length, using .stream() instead of .read() effectively disables Content-Length enforcement:

>>> import urllib3
>>> pool_manager = urllib3.PoolManager()
>>> response = pool_manager.request(
...     'GET',
...     'http://localhost:5000/test',
...     preload_content=False
... )
>>> list(response.stream())
[b'fourteen chars']

How Requests is broken

Unfortunately, requests.models.Response.iter_content(), which is invoked under the hood by e.g. the .text property, uses preload_content=True, does not pass enforce_content_length=True, and calls self.raw.stream(), which is precisely the awful combination of details needed to bypass both the incomplete read check in http.client and the incomplete read check in urllib3. This means that Requests fails to perform this check even in completely ordinary usage scenarios.

The fix

I think urllib3 should just make enforce_content_length=True by default. I can't really see a good reason for it to default to False, besides the (rather weak, IMO) point about backwards-compatibility that motivated not checking for this in http.client. There's no sane reason why Content-Length checking should be enabled when calling .read() but disabled when calling .read(999999), which is the status quo, and if anybody really does need that behaviour for some strange reason they can always change their code to explicitly pass enforce_content_length=False. @nateprewitt, as the author of the PR that added enforce_content_length to urllib3, what do you think?

If urllib3 makes that change, then Requests can simply bump their minimum required urllib3 version to whatever release includes the change, and be done.

Failing that, Requests should pass enforce_content_length=True explicitly.

(Edit note: I've rewritten this significantly to fix mistakes, and deleted some comments I left subsequently after integrating their content into here. I also realised after writing that @debugmiller had already pointed out that enforce_content_length was the thing that matters here; I started writing this up before finding this issue, and didn't scrutinise it properly before posting. Still, maybe this overview is still helpful. To anyone I spammed with emails from my multiple comments: sorry that I didn't get my ducks in a row before posting!)

@nateprewitt
Copy link
Member

It looks like the audit trail was followed for the most part here. We did merge this into Requests in #3563, but it's in a separate branch that was intended for 3.0. Adding the flag is a breaking change for the 2.x branch, so we're unable to resolve this until the next major version.

@nateprewitt
Copy link
Member

Just providing an update, I've opened urllib3/urllib3#2514 to change the default here in urllib3 2.0.

@hterik
Copy link

hterik commented May 11, 2022

It's great the next versions of both urllib3 and requests have the fixes. But those branches seem to have a very unpredictable timeplan before they can be released. "Before PyCon 2020" ?

Any suggestion how to work around this effectively for us mortals still on the released versions?

Right now we are looking at wrapping every file downloaded with requests, using our own content length and checksum checking to be safe. Which is a good idea anyway, but not always possible since you don't always have this side-channel data.

Considering the severity of this bug (silent corrupted files), is there any chance of getting this into the 2.x branch?
What's worse is that this bug also affects many middle-layer libraries today, which would all need similar increase in major version of requests, meaning rollout would take even longer or possibly never. Until then, any library using requests under the hood when downloading big files without being aware of this limitation (likely many), could return corrupted downloads to the main application without notice.

@bruceduhamel
Copy link

I ran into this problem today and found no fix in requests 2.x and no 3.x release to upgrade to.

I found this blog article that offers a solution in 2.x
https://blog.petrzemek.net/2018/04/22/on-incomplete-http-reads-and-the-requests-library-in-python/

I'm still torn between treating json parsing errors the same as transport errors, putting in the above blogs quick fix in a custom session, or trying out httpx instead.

@dimbleby
Copy link
Contributor

I guess the release of urllib3 2.0 fixes this, via urllib3/urllib3#2514?

(I'm slightly reluctant to mention this, because if you're happy to let it be fixed in a minor release via urllib3... then it seems to me that you might just as well have fixed it directly in requests long ago. So I'm worried you're now going to feel the need to break it again, for compatibility. Please don't!)

@nateprewitt
Copy link
Member

@dimbleby, yes, it is fixed in urllib3 2. We're aware of the change as we made it :) Users are free to upgrade to 2.0 as we made no mandated changes in Requests which is why this is acceptable within a minor version. Requests is still compatible with 1.26.x, as well as 2.x which includes this change.

I'll close this out now since there's a path forward for users with this issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants