_write_to_stream()
in the Download
class forces decompression of the response even if "Accept-Encoding: gzip" was present in the request
#49
Labels
priority: p2
Moderately-important priority. Fix may not be included in next release.
🚨
This issue needs some love.
type: bug
Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Note for Googlers: This is the follow-up for an internal-only-visible issue at https://issuetracker.google.com/issues/115343385.
The Decompressive Transcoding docs for Cloud Storage [1] state that if an object's Content-Encoding is set to "gzip", we'll decompress the object when someone performs a media request to download it... unless 1 of 2 conditions are true -- one of which is if the requester sets the "Accept-Encoding: gzip" header on the request; then we'll serve the bytes in gzipped form.
This works as expected from the service, but when downloading a gzipped object using the Cloud Storage Python client library [2] (which uses this library), the bytes end up being written to the desired file as if they were served decoded.
It turns out, the bytes are being served gzipped-encoded, but this library is decoding them before writing them out to the desired file/stream. I used this Python script to reproduce the issue:
and before running it, I set a breakpoint directly before this line:
https://github.com/GoogleCloudPlatform/google-resumable-media-python/blob/50f4c4d22cdaea71c794639226e819197f11f555/google/resumable_media/requests/download.py#L121
...and if you get to that breakpoint and execute
response.raw.read()
in the debugger, it gives you the gzipped bytes for the object. The problem comes from usingiter_content()
-- that method should NOT be used here ifresponse.request.headers
contains the ("accept-encoding", "gzip") pair, since it will explicitly decompress the stream, as mentioned in the warning box in the webdocs for the Requests library [3]:"""
Note
An important note about using Response.iter_content versus Response.raw. Response.iter_content will automatically decode the gzip and deflate transfer-encodings. Response.raw is a raw stream of bytes – it does not transform the response content. If you really need access to the bytes as they were returned, use Response.raw.
"""
We should be checking for that header, and if it's present, manually chunking/writing the bytes to the stream instead of using
iter_content()
.[1] https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding
[2] https://github.com/GoogleCloudPlatform/google-cloud-python/
[3] http://docs.python-requests.org/en/master/user/quickstart/#raw-response-content
The text was updated successfully, but these errors were encountered: