-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS decompressive transcoding not supported when reading #422
Comments
Thanks for reporting this and providing clear descriptions and solutions! Your possible solution seems like a good start, but how will we be able handle compression formats other than gzip? Could we just create a transport_param to toggle I think we will need to come up with a way to tag compressed files with their appropriate content_type metadata on upload as well. |
I misunderstood, other compression formats would not have this problem as they are not transparently decompressed by Google. I think we can get away with just creating a |
It'd be good if we can avoid adding more options. Can you think of a logical way to handle this? Why don't we just perform all the compression/decompression on our side? |
Ok @gdmachado's suggestion is probably what we want to do then. If |
I can confirm this problem, though I noticed it in a different manner. I tested with a file similar to above, but used "transport_params=dict(buffer_size=1024)" to force it to stream in parts. I have 4 cases to share with you. Testing with gcs files named file.txt and file.txt.gz, and ignore_ext=True and False. Data was uploaded to gcs
I modified the smart open code to add the "raw_download=True" to the download_as_string() call, and here are the results.
Hope this helps y'all to find a good solution. Included below is the quick and dirty script I used to demo this.
|
Problem description
Using
smart_open
(unreleased version with GCS support) to download files from GCS with transparent decompressive transcoding enabled may lead to incomplete files being downloaded depending on the compressed file size.With Google Cloud Storage there is the option to store gzip-compressed files & use decompressive transcoding to transparently decompress these when downloading. Decompression is thenhandled by Google servers. In this case, the filename wouldn't have any compression extension (eg.
file.csv
), however when inspecting it's metadata, it would contain something like that:This would be fine if it weren't for the fact that in such cases,
Blob()._size
will return the compressed size. Sincesmart_open
uses this to understand when to stop reading, it results in incomplete files.Steps/code to reproduce the problem
write ~400KB file (larger than smart_opens's default buffer size)
upload file to GCS
$ gsutil cp -Z ./rand.txt gs://my-bucket/ Copying file://./rand.txt [Content-Type=text/plain]... - [1 files][293.8 KiB/293.8 KiB] Operation completed over 1 objects/293.8 KiB.
resulting (compressed) file is 293.8 KiB.
check file metadata
download file using
smart_open
(gcloud credentials already set)check resulting file size
original file is 400KB however downloaded file is 348KB. not sure why it's still bigger than the
300842
reported by Google, though.Versions
Please provide the output of:
smart_open
has been pinned to 72818ca, installed withChecklist
Before you create the issue, please make sure you have:
Possible solutions
Setting
buffer_size
to a value larger than the compressed file size will of course download it in it's entirety, but for large files that would mean loading the entire file into memory.A reasonable option would be to check
Blob().content_type
, and if it is equal to'gzip'
, callBlob().download_as_string
withraw_download=True
, and then somehow handle decompression internally with the already-existing decompression mechanismsIf the maintainers agree this would be a viable solution, I'll be happy to provide a PR implementing it.
The text was updated successfully, but these errors were encountered: