Content-Length is not checked, resulting in short reads with no error #4956

debugmiller · 2019-02-01T04:17:37Z

Currently, requests does not detect when the length of a response body does not match the Content-Length header. This behavior results in undetected failures when a TCP connection is closed early (for example) and is contrary to the behavior of other tools (e.g. curl) and libraries (e.g. reqwest) which fail if a body length does not match Content-Length [3]. It is also contrary to the HTTP 1.1 RFC [5]. Frustration with this behavior is well documented ([1][2][3][4]) and was first reported as a bug 5 years ago [5]. In [1] it was agreed that the underlying checking should be done in urllib3. This was implemented in urllib3 v1.17 with the enforce_content_length setting [6]. However the setting defaults to False. At the time this was implemented in urllib3, there was a merge on requests:proposed/3.0.0 [7] that set enforce_content_length to True so that short reads would be detected (~~it was later reverted~~ edit: it wasn't reverted), but no such merge happened on requests 2.x branch.

Requests master now requires urllib3>=1.21 so we are guaranteed to have enforce_content_length. There is a strong argument to be made that enforce_content_length=True should be the default but at the very least it should be possible for a user to opt into this.

[1] #2833
[2] #2275
[3] https://blog.petrzemek.net/2018/04/22/on-incomplete-http-reads-and-the-requests-library-in-python/
[4] https://news.ycombinator.com/item?id=16896899
[5] #1855
[6] urllib3/urllib3#949
[7] #3563.

Expected Result

HTTP bodies that do not match the HTTP Content-Length header should be detected

Actual Result

HTTP bodies that do not match the HTTP Content-Length header are not detected.

Reproduction Steps

#2833 (comment)

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.7"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.0"
  },
  "platform": {
    "release": "18.2.0",
    "system": "Darwin"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.21.0"
  },
  "system_ssl": {
    "version": "1000211f"
  },
  "urllib3": {
    "version": "1.24"
  },
  "using_pyopenssl": false
}

The text was updated successfully, but these errors were encountered:

debugmiller · 2019-02-01T05:43:13Z

I should also add that I'm happy to port over #3563 but I'm unclear on why it was reverted or if the approach applies to requests 2.x

debugmiller · 2019-03-01T06:21:52Z

I have realized it wasn't reverted, I misread the PR that I thought reverted it.

ExplodingCabbage · 2020-01-11T15:10:58Z

I can confirm that Requests ignores Content-Length despite the fact that both urllib3 and the built-in http module check it, when used in the most obvious possible ways (and despite the fact that unrelated tools for making HTTP requests, like Curl, check it).

For a simple demonstration, first install Flask, copy the following code into a file, and run it:

from flask import Flask, make_response

app = Flask(__name__)

@app.route('/test', methods=['GET', 'POST'])
def send_incomplete_response():
    response = make_response('fourteen chars')
    response.headers['Content-Length'] = '10000'
    return response

app.run()

You're now running a server on localhost that, when you make a request to http://localhost:5000/test, returns a response with a 14-byte-long body but with a Content-Length header claiming that the body contains 10000 bytes. Now observe how different tools handle making a request to that endpoint:

mark@fractal:~$ curl http://localhost:5000/test
curl: (18) transfer closed with 9986 bytes remaining to read
fourteen charsmark@fractal:~$ 
mark@fractal:~$ python3.8
Python 3.8.0 (default, Oct 28 2019, 16:14:01) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import http.client
>>> conn = http.client.HTTPConnection('localhost', 5000)
>>> conn.request('GET', '/test')
>>> response = conn.getresponse()
>>> response.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/http/client.py", line 467, in read
    s = self._safe_read(self.length)
  File "/usr/lib/python3.8/http/client.py", line 610, in _safe_read
    raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(14 bytes read, 9986 more expected)
>>> 
>>> import urllib3
>>> pool_manager = urllib3.PoolManager()
>>> response = pool_manager.request('GET', 'http://localhost:5000/test')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 503, in read
    data = self._fp.read() if not fp_closed else b""
  File "/usr/lib/python3.8/http/client.py", line 467, in read
    s = self._safe_read(self.length)
  File "/usr/lib/python3.8/http/client.py", line 610, in _safe_read
    raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(14 bytes read, 9986 more expected)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 685, in urlopen
    response = self.ResponseCls.from_httplib(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 589, in from_httplib
    resp = ResponseCls(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 258, in __init__
    self._body = self.read(decode_content=decode_content)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 443, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(14 bytes read, 9986 more expected)', IncompleteRead(14 bytes read, 9986 more expected))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/urllib3/request.py", line 75, in request
    return self.request_encode_url(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/request.py", line 97, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/poolmanager.py", line 330, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 747, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 747, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 747, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /test (Caused by ProtocolError('Connection broken: IncompleteRead(14 bytes read, 9986 more expected)', IncompleteRead(14 bytes read, 9986 more expected)))
>>> 
>>> import requests
>>> response = requests.get('http://localhost:5000/test')
>>> response.text
'fourteen chars'

Used in the most straightforward possible way:

Curl outputs the response, but warns us about the content length not matching what was claimed in the header
http.client raises an IncompleteRead
urllib3 raises a MaxRetryError caused by an IncompleteRead
requests happily gives us the incomplete response with no indication that anything went wrong.

Whoops. So, why did this happen? Requests is built on top of urllib3 which is built on top of http.client, and Requests isn't explicitly catching and discarding an IncompleteRead error anywhere, so how can it fail to catch an error that the others emit?

The answer is that there are possible ways to use both http.client and urllib3 that prevent this error from being raised, and Requests is using them. Starting from the bottom...

How `http` is broken:

While http raises an IncompleteRead if you call .read() with no argument on an incomplete http.client.HTTPResponse object, it doesn't do so if you pass an argument to .read(). This is true whether you're reading a bit at a time or everything at once:

>>> import http.client
>>> 
>>> conn = http.client.HTTPConnection('localhost', 5000)
>>> conn.request('GET', '/test')
>>> response = conn.getresponse()
>>> response.read(1000)
b'fourteen chars'
>>> 
>>> conn = http.client.HTTPConnection('localhost', 5000)
>>> conn.request('GET', '/test')
>>> response = conn.getresponse()
>>> while True:
...     next_byte = response.read(1)
...     if next_byte == b'':
...         break
...     print(next_byte)
... 
b'f'
b'o'
b'u'
b'r'
b't'
b'e'
b'e'
b'n'
b' '
b'c'
b'h'
b'a'
b'r'
b's'

This was a deliberate decision; the handling of reads like this was apparently even more broken before (per https://bugs.python.org/issue15633, back in 2013 they not only wouldn't raise but also wouldn't mark the connection as closed), and @pitrou partially fixed it in
python/cpython@beec61a#diff-b1e875fb060aa473c57a8e87bf903a76R514-R517 but refrained from raising over it to avoid breaking backwards compatibility - presumably with programs that were relying on the old behaviour to be able to request stuff from servers that spit out erroneous Content-Length headers.

How `urllib3` is broken

urllib3 used to always fail to raise an IncompleteRead error in this situation, because it calls the aforementioned http.client.HTTPResponse.read(...) method with an argument rather than reading everything at once.

This was understood and then sort of fixed by @nateprewitt in urllib3/urllib3#949, by having urllib3 independently check, in the .read() method of the urllib3.response.HTTPResponse class, whether the number of bytes read matches the Content-Length header, and explicitly raise a urllib3.exceptions.IncompleteRead otherwise.

(Edit note: next paragraph and everything after it rewritten. I got some stuff quite wrong here at first.)

But that check isn't enabled by default; in urllib3, enforce_content_length defaults to False. So, unless you explicitly ask for it, the check is never applied. However, you get some protection from the fact that, by default, a HTTPResponse ins constructed with preload_content=True, which causes it to call self.read() without an argument, which in turn calls http.client.HTTPResponse.read() without an argument, which is the scenario in which http.client.HTTPResponse.read() enforces Content-Length properly.

However, things go wrong when we use the .stream() method. If we construct a response object with preload_content=False, we can call .stream() on it, which defers to .read(), but calls it with a length argument. This length argument gets passed through to http.client.HTTPResponse.read(), thus invoking the scenario where http.client does not enforce Content-Length. Thus, for a user who isn't using enforce_content_length, using .stream() instead of .read() effectively disables Content-Length enforcement:

>>> import urllib3
>>> pool_manager = urllib3.PoolManager()
>>> response = pool_manager.request(
...     'GET',
...     'http://localhost:5000/test',
...     preload_content=False
... )
>>> list(response.stream())
[b'fourteen chars']

How Requests is broken

Unfortunately, requests.models.Response.iter_content(), which is invoked under the hood by e.g. the .text property, uses preload_content=True, does not pass enforce_content_length=True, and calls self.raw.stream(), which is precisely the awful combination of details needed to bypass both the incomplete read check in http.client and the incomplete read check in urllib3. This means that Requests fails to perform this check even in completely ordinary usage scenarios.

The fix

I think urllib3 should just make enforce_content_length=True by default. I can't really see a good reason for it to default to False, besides the (rather weak, IMO) point about backwards-compatibility that motivated not checking for this in http.client. There's no sane reason why Content-Length checking should be enabled when calling .read() but disabled when calling .read(999999), which is the status quo, and if anybody really does need that behaviour for some strange reason they can always change their code to explicitly pass enforce_content_length=False. @nateprewitt, as the author of the PR that added enforce_content_length to urllib3, what do you think?

If urllib3 makes that change, then Requests can simply bump their minimum required urllib3 version to whatever release includes the change, and be done.

Failing that, Requests should pass enforce_content_length=True explicitly.

(Edit note: I've rewritten this significantly to fix mistakes, and deleted some comments I left subsequently after integrating their content into here. I also realised after writing that @debugmiller had already pointed out that enforce_content_length was the thing that matters here; I started writing this up before finding this issue, and didn't scrutinise it properly before posting. Still, maybe this overview is still helpful. To anyone I spammed with emails from my multiple comments: sorry that I didn't get my ducks in a row before posting!)

nateprewitt · 2021-08-28T03:57:55Z

It looks like the audit trail was followed for the most part here. We did merge this into Requests in #3563, but it's in a separate branch that was intended for 3.0. Adding the flag is a breaking change for the 2.x branch, so we're unable to resolve this until the next major version.

nateprewitt · 2022-01-02T20:05:06Z

Just providing an update, I've opened urllib3/urllib3#2514 to change the default here in urllib3 2.0.

hterik · 2022-05-11T16:48:19Z

It's great the next versions of both urllib3 and requests have the fixes. But those branches seem to have a very unpredictable timeplan before they can be released. "Before PyCon 2020" ?

Any suggestion how to work around this effectively for us mortals still on the released versions?

Right now we are looking at wrapping every file downloaded with requests, using our own content length and checksum checking to be safe. Which is a good idea anyway, but not always possible since you don't always have this side-channel data.

Considering the severity of this bug (silent corrupted files), is there any chance of getting this into the 2.x branch?
What's worse is that this bug also affects many middle-layer libraries today, which would all need similar increase in major version of requests, meaning rollout would take even longer or possibly never. Until then, any library using requests under the hood when downloading big files without being aware of this limitation (likely many), could return corrupted downloads to the main application without notice.

bruceduhamel · 2023-01-17T21:00:26Z

I ran into this problem today and found no fix in requests 2.x and no 3.x release to upgrade to.

I found this blog article that offers a solution in 2.x
https://blog.petrzemek.net/2018/04/22/on-incomplete-http-reads-and-the-requests-library-in-python/

I'm still torn between treating json parsing errors the same as transport errors, putting in the above blogs quick fix in a custom session, or trying out httpx instead.

dimbleby · 2023-07-24T21:22:11Z

I guess the release of urllib3 2.0 fixes this, via urllib3/urllib3#2514?

(I'm slightly reluctant to mention this, because if you're happy to let it be fixed in a minor release via urllib3... then it seems to me that you might just as well have fixed it directly in requests long ago. So I'm worried you're now going to feel the need to break it again, for compatibility. Please don't!)

nateprewitt · 2023-07-24T22:07:39Z

@dimbleby, yes, it is fixed in urllib3 2. We're aware of the change as we made it :) Users are free to upgrade to 2.0 as we made no mandated changes in Requests which is why this is acceptable within a minor version. Requests is still compatible with 1.26.x, as well as 2.x which includes this change.

I'll close this out now since there's a path forward for users with this issue.

that caused incomplete downoads without errors when transfer was interrupted. See psf/requests#4956 and https://github.com/urllib3/urllib3/pull/2514/files for details.

SagaciousZed mentioned this issue May 2, 2020

Check file after downloading and redownload if corrupt? alexgand/springer_free_books#19

Open

nateprewitt mentioned this issue Nov 16, 2020

RFC: What features should be included in v2? urllib3/urllib3#1944

Closed

guyskk mentioned this issue Jan 5, 2021

部分rss乱码 anyant/rssant#85

Closed

maxb mentioned this issue Feb 28, 2021

How to avoid "Input is not padded or padding is corrupt" error? streamlink/streamlink#2365

Closed

1 task

bastimeyer mentioned this issue Jun 2, 2021

plugin.api.http_session: enforce_content_length streamlink/streamlink#3768

Merged

FlMart mentioned this issue Aug 6, 2021

AttributeError when urllib3 throws an IncompleteRead getsentry/responses#394

Closed

cdecker mentioned this issue Sep 5, 2021

Sauron: be more robust when Esplora is sending invalid blocks lightningd/plugins#297

Merged

nateprewitt mentioned this issue Jan 2, 2022

Enable enforce_content_length by default urllib3/urllib3#2514

Merged

yichi-yang mentioned this issue Jul 20, 2022

Resume incomplete download pypa/pip#11180

Open

SamHames mentioned this issue Oct 30, 2022

Conversations: Unterminated String Error DocNow/twarc#661

Closed

yichi-yang mentioned this issue Jan 31, 2023

[Improvement] Pip could resume download package at halfway the connection is poor pypa/pip#4796

Open

puigru mentioned this issue Feb 26, 2023

Use of requests library could lead to partial file downloads without notice easlice/bandcamp-downloader#10

Closed

harupy mentioned this issue Mar 27, 2023

Incomplete download of large artifacts [BUG] mlflow/mlflow#8104

Open

22 tasks

nateprewitt closed this as completed Jul 24, 2023

dimbleby mentioned this issue Jul 24, 2023

interrupted download reports as hash failure pypa/pip#11153

Open

1 task

zweger mentioned this issue Aug 23, 2023

Requests are not retried when received body length is shorter than Content-Length #6512

Open

joren485 mentioned this issue Sep 9, 2023

Pass response_kw to HTTPConnectionPool through HTTPAdapter.send #6523

Open

mariosasko mentioned this issue Nov 2, 2023

Error while downloading the xtreme udpos dataset huggingface/datasets#5594

Closed

gregorjerse mentioned this issue Jan 4, 2024

Fix dowload bug genialis/resolwe-bio-py#338

Merged

vondele mentioned this issue Jul 7, 2024

Verify sha before cache write official-stockfish/fishtest#2112

Open

github-actions bot locked as resolved and limited conversation to collaborators Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content-Length is not checked, resulting in short reads with no error #4956

Content-Length is not checked, resulting in short reads with no error #4956

debugmiller commented Feb 1, 2019 •

edited

Loading

debugmiller commented Feb 1, 2019

debugmiller commented Mar 1, 2019

ExplodingCabbage commented Jan 11, 2020 •

edited

Loading

nateprewitt commented Aug 28, 2021

nateprewitt commented Jan 2, 2022

hterik commented May 11, 2022

bruceduhamel commented Jan 17, 2023

dimbleby commented Jul 24, 2023

nateprewitt commented Jul 24, 2023

Content-Length is not checked, resulting in short reads with no error #4956

Content-Length is not checked, resulting in short reads with no error #4956

Comments

debugmiller commented Feb 1, 2019 • edited Loading

Expected Result

Actual Result

Reproduction Steps

System Information

debugmiller commented Feb 1, 2019

debugmiller commented Mar 1, 2019

ExplodingCabbage commented Jan 11, 2020 • edited Loading

How http is broken:

How urllib3 is broken

How Requests is broken

The fix

nateprewitt commented Aug 28, 2021

nateprewitt commented Jan 2, 2022

hterik commented May 11, 2022

bruceduhamel commented Jan 17, 2023

dimbleby commented Jul 24, 2023

nateprewitt commented Jul 24, 2023

debugmiller commented Feb 1, 2019 •

edited

Loading

ExplodingCabbage commented Jan 11, 2020 •

edited

Loading

How `http` is broken:

How `urllib3` is broken