Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] S3FileSystem write_table can lose data #30078

Closed
asfimport opened this issue Oct 29, 2021 · 10 comments
Closed

[C++][Python] S3FileSystem write_table can lose data #30078

asfimport opened this issue Oct 29, 2021 · 10 comments

Comments

@asfimport
Copy link
Collaborator

We have seen odd behavior in very rare occasions when writing a parquet table to s3 using the S3FileSystem (from pyarrow.fs import S3FileSystem).  Even though the application returns without errors, data would be missing from the bucket.  It appears that internally it's doing a S3 multipart upload, but it's not handling a special error condition and returning a 200.  Per AWS Docs CompleteMultipartUpload (which is being called) can return a 200 response with an InternalError payload and needs to be treated as a 5XX. It appears this isn't happening with pyarrow and instead it's a success which is causing the caller to think their data was uploaded but actually it's not. 

Doing a s3 list-parts call for the for the InternalError request shows the parts are still there and not completed.

From our S3 access logs with and sanitized for security 

|operation|key|requesturi_operation|requesturi_key|requesturi_httpprotoversion|httpstatus|errorcode|
|
|-|-|-|-|-|-|-|-|
|REST.PUT.PART|-SNAPPY.parquet|PUT|/-SNAPPY.parquet?partNumber=1&uploadId=|HTTP/1.1|200|-|
|
|REST.POST.UPLOAD|-SNAPPY.parquet|POST|/-SNAPPY.parquet?uploadId=|HTTP/1.1|200|InternalError|
|
|REST.POST.UPLOADS|-SNAPPY.parquet|POST|/-SNAPPY.parquet?uploads|HTTP/1.1|200|-|



 |

Reporter: Mark Seitter
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

Note: This issue was originally created as ARROW-14523. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
How is this supposed to be done with the AWS C++ API? We're using this API:

https://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_s3_client.html#ac4b5211a2b2e499d8094e4f13bb3c29c

As you can see, either the API returns an error (and we propagate it to the caller), or it returns a CompleteMultipartUploadResult which doesn't have any attribute to check for errors in the response body.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
And it seems to be a bug in the AWS C++ SDK:

aws/aws-sdk-cpp#658

Yikes.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Workaround will probably be to reimplement CompleteMultipartUpload in our own S3Client subclass:

https://github.com/aws/aws-sdk-cpp/blob/main/aws-cpp-sdk-s3/source/S3Client.cpp#L250-L275

 

@asfimport
Copy link
Collaborator Author

Mark Seitter:
@pitrou Sorry I wasn't sure exactly how those calls are made (I will admit I'm new in this space).  Nice find on the SDK bug, but like you said Yikes! they have never resolved it in c++ sdk.  Out of curiosity, Is there a reason pyarrow is always multipart uploads and also putobject?  Seems wasteful to upload via multipart if you only have 1 part (ie the object size is small) since it's making 3 API calls vs just one (more $$$ and latency too).  I understand using multipart for very large files to break into parts, but analyzing all our calls it seems we never have more than 1 part number on an upload.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
The filesystem API doesn't know up front what the total file size will be, which is why it always uses multipart upload. Perhaps we could have a one-shot file write API that would use another API for shorter uploads. That said, that API probably wouldn't be used by the Parquet writer...

@asfimport
Copy link
Collaborator Author

Mark Seitter:
Ahh ok that makes sense if you don't know the file size.  Agreed, I think only thing maybe you could do is have a flag/option that would let you tell the filesystem API to use single-file upload that a user could pass in if they know the file is less than 5GB (max a single put can do).  That would let someone like us who has a max file size of 30MB to reduce our cost and improvement performance by skipping multipart calls (and bypassing this current bug too :) ). But ultimately I'm very surprised AWS hasn't fixed this bug in their SDK yet.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Okay, I think a slightly gory workaround will be possible. Will continue tomorrow.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Which Python version do you use? Are you using binary wheel or conda packages? We may be able to produce a test package for you with the candidate fix.

@asfimport
Copy link
Collaborator Author

Mark Seitter:
@pitrou We are using Python 3.9 and we install via pip today.  Not entirely sure what we can test on our side though unfortunately since this is an edge case which we have yet to be able to reproduce on demand.  We are in talks with AWS if there is a way to produce this issue for testing across our code (we use multipart in many places also).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 11594
#11594

@asfimport asfimport added this to the 6.0.1 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants