-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gsutil cp -Z always force adds Cache-control: no-transform
and Content-Encoding: gzip
. Breaks http protocol
#480
Comments
Cache-control: no-transform
. Breaks Http protocolCache-control: no-transform
and Content-Encoding: gzip
. Breaks Http protocol
Cache-control: no-transform
and Content-Encoding: gzip
. Breaks Http protocolCache-control: no-transform
and Content-Encoding: gzip
. Breaks http protocol
@houglum ^ |
This behavior (ignoring the Accept-Encoding header) is documented here:
...although it also seems like it would be helpful for us to mention this (along with the fact that we apply the no-transform cache-control directive) in the docs for the |
Basically what I am saying is, there should be a way to turn off the
|
If we remove So if we add such an option to drop To put it differently, I cannot see a way to author a fully compatible solution with GCS's current behavior. As a workaround, you can remove |
GCS may remove a layer of compression prior to sending the object *even
when Accept-Encoding:gzip is provided*.
Gcs may remove? I'm not sure I fully comprehend. Isn't this behavior
deterministic? If I upload gzipped and ask for gzipped, gcs should always
give me gzipped right?
It seems you are adding a no-transform header to the object as a marker
that object was uploaded gzipped. when what you want is a header that
communicates that the md5 hash is that of the compressed content and not
the actual content e.g a header like 'store-format: gzip' vs 'store-format:
raw'. No-transform seems to be a hack that gets the job but with serious
side effects.
Anyways we'll try to workaround with a set-meta to remove the
no-cache-control flag after doing the uploads.
I do hope that you won't close this as "won't fix" but work towards a
better abstraction.
|
Take a look at the second paragraph of Using gzip on compressed objects; if you upload gzipped and ask for gzipped and GCS considers the content type to be incompressible, it will remove the encoding regardless of your request. Then it will serve bytes that will not match the MD5 stored in the object's metadata. I think there is a core issue with the service in that GCS does not publish the content-types that it considers to be incompressible; as such that list is also subject to change. I agree there are serious side-effects to using no-transform as an approach; we decided on this as a compromise in |
So you're essentially saying Do you know where I can file an issue for the root cause? This seems like bad design on so many levels. I would expect GCS to just be a dumb store of bytes and follow the content-encoding: gzip http spec. |
Just to be clear, GCS will never touch the stored data, this is exclusively about the encoding when sending it over the wire. |
@thobrla For my use case, the behavior of always getting gzipped content when the object's Cache-Control is set to 'no-transform' works fine for optimized web serving. However, I did some extra tinkering and removed "no-transform" from Cache-Control as you suggested, but that causes another issue, the expected behavior should be that the server respects the "Accept-Encoding" header, if 'gzip' is included it should return gzipped content (no decompressive transcoding, serve file as is stored), if no "Accept-Encoding: gzip” request header is included it should DO decompressive transcoding as documented (right?), in both cases it should respond with according "Content-Encoding" header. BUT it appears that GCS ignores the “Accept-Encoding” request header and always does descompressive transcoding.
For example, the following file has the following metadata in GCS: When requesting file with request header 'Accept-Encoding: gzip’, the server doesn’t respond with header "ContentEncoding: gzip" and the image is NOT compressed/gzipped, therefore, it forces decompressive transcoding incorrectly, notice the header ‘Warning: 214 UploadServer gunzipped’ which I suppose is how google informs clients that it actually did decompression. With -H “Accept-Encoding: gzip” :
Without -H “Accept-Encoding: gzip” :
|
Thanks for the detailed reproduction of the issue. I'm discussing with the GCS team internally and we may be able to fix the service from removing a layer of compression even when I'll let you know when I have more details. |
Thanks @thobria. Really appreciate this getting fixed from the root cause.
… |
awesome @thobrla! thanks! |
@thobrla any updates on this? |
Update: the work to stop the GCS service from unnecessarily removing a layer of compression is understood but larger-effort than GCS team originally thought. Part of that work is complete, but finishing the remainder isn't on the team's priorities for the near future. Until that changes, we'll have to live with this behavior in clients. I think the Cache-Control behavior of gsutil is the best default (given that it can be dsiabled with Leaving this issue open to track fixing this if GCS service implements the fix. |
Not sure if the GSC team also has a github repo or a public bug tracker. If
possible, would be nice to have a link to the underlying tracking bug.
Thanks for the follow up though @thobrla
|
When I remove header with 'setmeta' by using only |
Hi, I have discovered this bug report by the way of googleapis/google-cloud-python#4227 and googleapis/google-resumable-media-python#34. Retrieving a blob with the Python library:
See that {
"kind": "storage#object",
"contentType": "image/jpeg",
"name": "binary/00da00d2ddc203a245753a8c1276c0d398341abd",
"timeCreated": "2017-12-18T09:09:33.635Z",
"generation": "1513588173639529",
"md5Hash": "Lbe8pGpkq2fctqveModTlw==",
"bucket": "xxx",
"updated": "2017-12-18T09:09:33.635Z",
"contentEncoding": "gzip",
"crc32c": "fkoHfw==",
"metageneration": "1",
"mediaLink": "https://www.googleapis.com/download/storage/v1/b/xxx/o/binary%2F00da00d2ddc203a245753a8c1276c0d398341abd?generation=1513588173639529&alt=media",
"storageClass": "REGIONAL",
"timeStorageClassUpdated": "2017-12-18T09:09:33.635Z",
"cacheControl": "no-transform",
"etag": "COmetKubk9gCEAE=",
"id": "xxx/binary/00da00d2ddc203a245753a8c1276c0d398341abd/1513588173639529",
"selfLink": "https://www.googleapis.com/storage/v1/b/xxx/o/binary%2F00da00d2ddc203a245753a8c1276c0d398341abd",
"size": "368849"
} And, sure enough, retrieving the blob using the public URL works as expected according the documentation:
Out of curiousity, I tried playing with the So is there some |
Black magic seems to be the appropriate term, because it looks like the content type of the uploaded blob has an effect on this unexpected decompression.
Could some from Google report on that? Thanks. |
@dalbani : see the documentation at https://cloud.google.com/storage/docs/transcoding#gzip-gzip on Google Cloud Storage's current behavior regarding compressible content-types. Per my comments above, the work to stop GCS from removing a layer of compression isn't currently prioritized. |
I believe it’s not about “removing layer of compression”. It’s about making
compression deterministic so it behaves well with the web clients.
…On Wed, Jan 10, 2018 at 4:06 PM thobrla ***@***.***> wrote:
@dalbani <https://github.com/dalbani> : see [
https://cloud.google.com/storage/docs/transcoding#gzip-gzip](this
documentation) on Google Cloud Storage's current behavior regarding
compressible content-types. Per my comments above, the work to stop GCS
from removing a layer of compression isn't currently prioritized.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#480 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA-JVMbVUE_ht08j8ai9QOiY_ArPSSIDks5tJVCMgaJpZM4P_smJ>
.
|
@thobrla: thanks for your response, but I have already had a look at this documentation. |
@dalbani Thanks for the report. I can't reproduce this issue, though - I tried out your scenario with an Content-Type image/jpeg, Content-Encoding gzip, Cache-Control: no-transform object and did not see an unzipped response. Can you construct |
I am seeing a buggy behaviour too where setmeta Cache-Control overrides gzipping functionality
After
So @thobrla it seems your recommendation of |
@thobrla TL;DR: the so-called "UploadServer" treats some content types differently than other when downloading resources via For example, let's say I run the script with an empty, gzip'ed 32x32 JPEG file:
If I run my script with
See that both requests to This transparent gunzip causes problems with, for example, the Python library, which sends an
Now, let's compare with the output of the same command but with a different content type, e.g.
Summary: responses are respectively of 165, 137 (not 165 as above!) and 137 bytes. And here the Python library has no problem downloading the blob and checking the MD5 checksum. Although very long, I hope this post is clear enough so you can pinpoint and eventually fix the issue. |
Thanks @dalbani - I think at this point the problem is well understood and we're waiting for the Cloud Storage team to prioritize a fix (but to my knowledge it's not currently prioritized). |
The Object Transcoding documentation and |
I would go further than @yonran and say the documentation should definitely be modified : this is a really frustrating omission. Also, for publishing static web assets, it's really frustrating to have no flag/option to disable this behaviour : I have no need to ever download these again with gsutil so the checksum thing isn't an issue - I just want to gzip them on the way up and have GCS then serve them to clients in line with the documentation.... |
Closing this issue - documentation is now updated on cloud.google.com, and I'm backfilling the source files here in Github to match. |
@starsandskies great that the docs have been updated - thanks for that. I'm not sure it's valid to close this issue though. When uploading more than a few files - e.g. for web assets / static sites - it is extremely inefficient to have to run That adds a fair time overhead, and more importantly creates the risk of objects existing in the bucket with inconsistent / unexpected state if the second If we specify FWIW although the docs are now clearer I think it's still not obvious that at present the |
I definitely think there are improvements to the tool that could be made (and, fwiw, the push to fix the underlying behavior that necessitates the -z behavior had a renewed interest at the end of 2020). I've no objection to re-opening this (I assume you're able to, though let me know if not - I'm not a Github expert by any stretch of the imagination), but this thread has gotten quite long and meander-y. I'd recommend taking the relevant points and making a fresh issue that cuts away the excess. |
@starsandskies thanks for the response - no, I can't reopen, only core contributors/admins can reopen on Github. I couldn't see a branch / pull request relevant to I'm happy to make a new issue, tho I think the issue description and first couple of comments here (e.g. #480 (comment)) capture the problem and IMO there's an advantage to keeping this issue alive since there are already people watching it. But if you'd prefer a new issue I'll open one and reference this. |
Ah, in that case, I'll reopen this. To answer you question, my understanding is that what blocks a true fix is on the server side and that it affects other tools as well, such as the client libraries (see, for example, googleapis/nodejs-storage#709) |
@starsandskies thanks :) Yes, I see the problem on that nodejs-storage issue. I think though it breaks into two usecases:
AFAICS the second usecase was working without any problems, until the gsutil behaviour was changed to fix the first case. The key thing is that it's obviously still valid to have gzipped files in the bucket with transitive decompression enabled - nothing stops you setting your own Cache-Control header after the initial upload. And that obviously fixes usecase 2 but breaks usecase 1. That being the case, I don't think there's any good reason why gsutil should silently prevent you from doing that in a single call, even if you want to keep the default behaviour as it is now. |
Since we just stumbled upon this issue when trying to move to GCP/GCS for our CDN assets, and this thread was very helpful in figuring out why, I wanted to leave a piece of feedback from the user side for this topic. There are many responses (I assume from maintainers/developers at GCP/gsutil) that suggest that per default adding the
I just want to say that, as a user of GCP, I harshly disagree with that assessment. From my perspective, the actual state of things is that For an outside user of GCP, it seems like the respective team isn't interested in fixing the bug, so it's hiding the issue behind a different and harder-to-notice issue, just so it's technically not broken on their end per their definition. As a user of GCP, I don't care about whether To be clear: the default behaviour of In my personal view, |
Rolled back the change so we will need to follow-up again when we have an update. Apologies, I jinxed it. |
Curl client should be able to receive the unzipped version, but GCS always returns content-encoding gzip. This breaks HTTP/1.1 protocol since the client didn't give
"Accept-Encoding: gzip, deflate, br"
header.Seems to be happening here
gsutil/gslib/copy_helper.py
Lines 1741 to 1759 in e8154ba
The text was updated successfully, but these errors were encountered: