Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_write_to_stream() in the Download class forces decompression of the response even if "Accept-Encoding: gzip" was present in the request #49

Closed
houglum opened this issue Sep 19, 2018 · 2 comments · Fixed by #87
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@houglum
Copy link

houglum commented Sep 19, 2018

Note for Googlers: This is the follow-up for an internal-only-visible issue at https://issuetracker.google.com/issues/115343385.

The Decompressive Transcoding docs for Cloud Storage [1] state that if an object's Content-Encoding is set to "gzip", we'll decompress the object when someone performs a media request to download it... unless 1 of 2 conditions are true -- one of which is if the requester sets the "Accept-Encoding: gzip" header on the request; then we'll serve the bytes in gzipped form.

This works as expected from the service, but when downloading a gzipped object using the Cloud Storage Python client library [2] (which uses this library), the bytes end up being written to the desired file as if they were served decoded.

It turns out, the bytes are being served gzipped-encoded, but this library is decoding them before writing them out to the desired file/stream. I used this Python script to reproduce the issue:

import os
from google.cloud import storage

BUCKET='set your bucket name here'
GZIPPED_OBJECT='set your object name here'

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json/keyfile.json'

# The object {GZIPPED_OBJECT} contains gzipped text, and I've set its
# Content-Encoding header to the value "gzip". 
client = storage.Client()
bkt = client.get_bucket(BUCKET)
blob = bkt.get_blob(GZIPPED_OBJECT)

with open('/tmp/gzipped-from-cli-lib', 'w') as f:
  blob.download_to_file(f, client)

and before running it, I set a breakpoint directly before this line:
https://github.com/GoogleCloudPlatform/google-resumable-media-python/blob/50f4c4d22cdaea71c794639226e819197f11f555/google/resumable_media/requests/download.py#L121

...and if you get to that breakpoint and execute response.raw.read() in the debugger, it gives you the gzipped bytes for the object. The problem comes from using iter_content() -- that method should NOT be used here if response.request.headers contains the ("accept-encoding", "gzip") pair, since it will explicitly decompress the stream, as mentioned in the warning box in the webdocs for the Requests library [3]:
"""
Note
An important note about using Response.iter_content versus Response.raw. Response.iter_content will automatically decode the gzip and deflate transfer-encodings. Response.raw is a raw stream of bytes – it does not transform the response content. If you really need access to the bytes as they were returned, use Response.raw.
"""

We should be checking for that header, and if it's present, manually chunking/writing the bytes to the stream instead of using iter_content().

[1] https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding
[2] https://github.com/GoogleCloudPlatform/google-cloud-python/
[3] http://docs.python-requests.org/en/master/user/quickstart/#raw-response-content

@dhermes
Copy link
Contributor

dhermes commented Sep 19, 2018

This was something I was worried could happen but never adequately followed up, thanks @houglum. There was discussion in #35 and #36 (and related issues).

@houglum
Copy link
Author

houglum commented Sep 20, 2018

Thanks for the quick response!

I'd write up a fix for this, but I've got other higher-priority work lined up for the next couple of weeks. And the bulk of my time on this would be spent looking through the tests and figuring out how to add a test for it :P

Is it likely this will be fixed anytime soon? If not, I can take a stab at it when I get the time.

@JustinBeckwith JustinBeckwith added triage me I really want to be triaged. 🚨 This issue needs some love. labels Dec 8, 2018
@tseaver tseaver added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jan 16, 2019
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Mar 18, 2019
tseaver added a commit that referenced this issue Aug 14, 2019
Allowing 'requests' to expand / inflate the file's body defeats downstream
excpectations.

Closes #37.
Closes #49.
Closes #50.
tseaver added a commit that referenced this issue Aug 27, 2019
Don't expect 'Content-Length' header for chunked downloads.

Closes #37.
Closes #49.
Closes #50.
Closes #56.
Closes #76.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
5 participants