Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD5 validation broken? #34

Closed
danqing opened this issue Oct 16, 2017 · 14 comments · Fixed by #36
Closed

MD5 validation broken? #34

danqing opened this issue Oct 16, 2017 · 14 comments · Fixed by #36
Assignees
Labels
🚨 This issue needs some love. triage me I really want to be triaged.

Comments

@danqing
Copy link

danqing commented Oct 16, 2017

I have a file that I uploaded and then downloaded with google-resumable-media under the hood. It seems that 0.3 reports that the X-Goog-Hash header reports a different MD5 than the computed one. The downloaded file however is intact.

The only thing is my file is gzipped. Does gzip break the MD5 calculation?

Thanks!


Also to be closed: googleapis/google-cloud-python#4227

@dhermes dhermes self-assigned this Oct 16, 2017
@dhermes
Copy link
Contributor

dhermes commented Oct 16, 2017

Thanks for filing @danqing. I'll look into it. It might break in a way that we didn't expect (i.e. we made a programming error).

/cc @mfschwartz

@dhermes
Copy link
Contributor

dhermes commented Oct 16, 2017

I'll be using the sdist for 0.3.0 as a test (google-resumable-media-0.3.0.tar.gz)

@dhermes
Copy link
Contributor

dhermes commented Oct 16, 2017

OK I just tried to reproduce and could not. @danqing could you provide a gzip-ed file? Is this happening in a reproducible fashion for you? Do you have a stacktrace you could share?


Here is what I used to try to reproduce:

$ virtualenv venv
$ venv/bin/pip install ipython requests google-auth google-resumable-media
$ gsutil cp google-resumable-media-0.3.0.tar.gz gs://${BUCKET}
$ venv/bin/ipython
In [1]: import google.auth

In [2]: import google.auth.transport.requests as tr_requests

In [3]: from google.resumable_media.requests import Download

In [4]: ro_scope = u'https://www.googleapis.com/auth/devstorage.read_only'

In [5]: credentials, _ = google.auth.default(scopes=(ro_scope,))

In [6]: transport = tr_requests.AuthorizedSession(credentials)

In [7]: blob_name = 'google-resumable-media-0.3.0.tar.gz'

In [8]: url_template = (
   ...:     'https://www.googleapis.com/download/storage/v1/b/'
   ...:     '{bucket}/o/{blob_name}?alt=media')
   ...: 

In [9]: media_url = url_template.format(
   ...:     bucket='${BUCKET}', blob_name=blob_name)
   ...: 

In [10]: with open('local.tar.gz', 'wb') as stream:
    ...:     download = Download(media_url, stream=stream)
    ...:     download.consume(transport)
    ...: 
$
$ diff -s local.tar.gz google-resumable-media-0.3.0.tar.gz
Files local.tar.gz and google-resumable-media-0.3.0.tar.gz are identical

@mfschwartz
Copy link

Danny - I suspect this problem may happen when the content-encoding is set to gzip.
Try this:
(create a text file, x, then do:)
% gsutil cp -Z x gs://your-bucket/x

You can see the content-encoding using:
% gsutil ls -L gs://your-bucket/x

Then try to download it using the updated library.

@mfschwartz
Copy link

I confirmed that having content-encoding:gzip causes this failure.
The problem is that the local MD5 is being computed against the un-gzipped content.

Is google-cloud-python uncompressing the object on the fly, or is it not setting accept-encoding': 'gzip, deflate' ?

@gaetano-guerriero
Copy link

Confirmed here. When content_encoding is gzip on download x-goog-hash refers to compressed object, the library checks it against the uncompressed bytes stream.

@dhermes
Copy link
Contributor

dhermes commented Oct 17, 2017

@mfschwartz I have also reproduced (see below). I'd love to add a system test to this library (google-resumable-media). Is it as simple as gzip-ing the bytes locally and then sending Content-Encoding: gzip and Content-Type: text/plain as headers? (UPDATE: Sorry, just re-read that last sentence. I meant "is emulating gsutil cp -Z as simple as...".)

UPDATE: I "figured" it out:

>>> import gzip
>>> from google.resumable_media.requests import MultipartUpload
>>>
>>> url_template = (
...     u'https://www.googleapis.com/upload/storage/v1/b/{bucket}/o?'
...     u'uploadType=multipart')
>>> upload_url = url_template.format(bucket=bucket)
>>>
>>> upload = MultipartUpload(upload_url)
>>> metadata = {
...     u'name': blob_name,
...     u'contentEncoding': u'gzip',
... }
>>> data = gzip.compress(b'Stuff\n')
>>> content_type = u'text/plain'
>>> response = upload.transmit(transport, data, metadata, content_type)

Here is what I did:

$ echo "Stuff" > stuff.txt
$ gsutil cp -Z stuff.txt gs://${BUCKET}
$ gsutil ls -L gs://${BUCKET}/stuff.txt
gs://${BUCKET}/stuff.txt:
    Creation time:          Tue, 17 Oct 2017 15:42:27 GMT
    Update time:            Tue, 17 Oct 2017 15:42:27 GMT
    Storage class:          STANDARD
    Cache-Control:          no-transform
    Content-Encoding:       gzip
    Content-Length:         36
    Content-Type:           text/plain
    Hash (crc32c):          1RPvTA==
    Hash (md5):             matKHCUjwpcS/fYLXwgV3Q==
...
$
$ virtualenv venv
$ venv/bin/pip install ipython requests google-auth google-resumable-media
$ venv/bin/ipython
In [1]: import io

In [2]: import google.auth

In [3]: import google.auth.transport.requests as tr_requests

In [4]: from google.resumable_media.requests import Download

In [5]: ro_scope = u'https://www.googleapis.com/auth/devstorage.read_only'

In [6]: credentials, _ = google.auth.default(scopes=(ro_scope,))

In [7]: transport = tr_requests.AuthorizedSession(credentials)

In [8]: bucket = '${BUCKET}'

In [9]: blob_name = 'stuff.txt'

In [10]: url_template = (
    ...:     u'https://www.googleapis.com/download/storage/v1/b/'
    ...:     u'{bucket}/o/{blob_name}?alt=media')
    ...: 

In [11]: media_url = url_template.format(
    ...:     bucket=bucket, blob_name=blob_name)
    ...: 

In [12]: stream = io.BytesIO()

In [13]: download = Download(media_url, stream=stream)

In [14]: response = download.consume(transport)
---------------------------------------------------------------------------
DataCorruption                            Traceback (most recent call last)
<ipython-input-14-74c38720271c> in <module>()
----> 1 response = download.consume(transport)

.../venv/lib/python3.6/site-packages/google/resumable_media/requests/download.py in consume(self, transport)
    167 
    168         if self._stream is not None:
--> 169             self._write_to_stream(result)
    170 
    171         return result

.../venv/lib/python3.6/site-packages/google/resumable_media/requests/download.py in _write_to_stream(self, response)
    130             msg = _CHECKSUM_MISMATCH.format(
    131                 self.media_url, expected_md5_hash, actual_md5_hash)
--> 132             raise common.DataCorruption(response, msg)
    133
    134     def consume(self, transport):

DataCorruption: Checksum mismatch while downloading:

  https://www.googleapis.com/download/storage/v1/b/${BUCKET}/o/stuff.txt?alt=media

The X-Goog-Hash header indicated an MD5 checksum of:

  matKHCUjwpcS/fYLXwgV3Q==

but the actual MD5 checksum of the downloaded contents was:

  Eypyl7eiV79Z1QkQqoXh4Q==


In [15]: 

@mfschwartz
Copy link

Hi Danny - I'm working on a fix now, as well as a system test.

@dhermes
Copy link
Contributor

dhermes commented Oct 17, 2017

@mfschwartz Thanks.

Relevant / related: https://github.com/GoogleCloudPlatform/google-cloud-python/pull/3380/files was just merged to google-cloud-storage (though has not been released just yet)

Questions for you:

  • Is using 'accept-encoding': 'gzip, deflate' going to be a performance drop?
  • Does the MD5 hash correspond to the decoded data or the gzipped data? (It seems like the gzip-ed data.) We might want to hook into whatever magic requests is using to decode under the hood? (I realize I can check this in part, and will do so right now.)

Somewhat of an answer to my question:

In [14]: response = download.consume(transport)
---------------------------------------------------------------------------
DataCorruption                            Traceback (most recent call last)
<ipython-input-14-74c38720271c> in <module>()
----> 1 response = download.consume(transport)

.../venv/lib/python3.6/site-packages/google/resumable_media/requests/download.py in consume(self, transport)
    167 
    168         if self._stream is not None:
--> 169             self._write_to_stream(result)
    170 
    171         return result

.../venv/lib/python3.6/site-packages/google/resumable_media/requests/download.py in _write_to_stream(self, response)
    130             msg = _CHECKSUM_MISMATCH.format(
    131                 self.media_url, expected_md5_hash, actual_md5_hash)
--> 132             raise common.DataCorruption(response, msg)
    133
    134     def consume(self, transport):

DataCorruption: Checksum mismatch while downloading:

  https://www.googleapis.com/download/storage/v1/b/${BUCKET}/o/stuff.txt?alt=media

The X-Goog-Hash header indicated an MD5 checksum of:

  matKHCUjwpcS/fYLXwgV3Q==

but the actual MD5 checksum of the downloaded contents was:

  Eypyl7eiV79Z1QkQqoXh4Q==


In [15]: stream.getvalue()
Out[15]: b'Stuff\n'

@mfschwartz
Copy link

mfschwartz commented Oct 17, 2017

  • Using 'accept-encoding': 'gzip, deflate' will boost performance, because the server will send the object gzipped instead of gunzipping on the fly (so, require downloading fewer bytes)
  • The MD5 hash corresponds to the data as stored on the server. So, when stored with content-encoding: gzip the MD5 covers the gzipped data

@dhermes
Copy link
Contributor

dhermes commented Oct 17, 2017

Also relevant: https://stackoverflow.com/a/25811745/1068170.

@zmunro
Copy link

zmunro commented Jul 26, 2019

This isn't fixed for me with google-resumanle-media 0.3.2 :(

@tseaver
Copy link
Contributor

tseaver commented Jul 26, 2019

@zmunro Can you provide more information? E.g., is it broken for you just for gzipped content?

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. triage me I really want to be triaged. labels Apr 6, 2020
@david-alexander-white
Copy link

I'm now getting this error today after months of no problems, on some gzipped files from GCS. Error seems to happen at random, i.e. can do the same thing and maybe 50% of the time it works, 50% of the time it crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚨 This issue needs some love. triage me I really want to be triaged.
Projects
None yet
8 participants