-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] S3FileSystem write_table can lose data #30078
Comments
Antoine Pitrou / @pitrou: As you can see, either the API returns an error (and we propagate it to the caller), or it returns a |
Antoine Pitrou / @pitrou: Yikes. |
Antoine Pitrou / @pitrou: https://github.com/aws/aws-sdk-cpp/blob/main/aws-cpp-sdk-s3/source/S3Client.cpp#L250-L275
|
Mark Seitter: |
Antoine Pitrou / @pitrou: |
Mark Seitter: |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
Mark Seitter: |
Antoine Pitrou / @pitrou: |
We have seen odd behavior in very rare occasions when writing a parquet table to s3 using the S3FileSystem (from pyarrow.fs import S3FileSystem). Even though the application returns without errors, data would be missing from the bucket. It appears that internally it's doing a S3 multipart upload, but it's not handling a special error condition and returning a 200. Per AWS Docs CompleteMultipartUpload (which is being called) can return a 200 response with an InternalError payload and needs to be treated as a 5XX. It appears this isn't happening with pyarrow and instead it's a success which is causing the caller to think their data was uploaded but actually it's not.
Doing a s3 list-parts call for the for the InternalError request shows the parts are still there and not completed.
From our S3 access logs with and sanitized for security
|operation|key|requesturi_operation|requesturi_key|requesturi_httpprotoversion|httpstatus|errorcode|
|
|-|-|-|-|-|-|-|-|
|REST.PUT.PART|-SNAPPY.parquet|PUT|/-SNAPPY.parquet?partNumber=1&uploadId=|HTTP/1.1|200|-|
|
|REST.POST.UPLOAD|-SNAPPY.parquet|POST|/-SNAPPY.parquet?uploadId=|HTTP/1.1|200|InternalError|
|
|REST.POST.UPLOADS|-SNAPPY.parquet|POST|/-SNAPPY.parquet?uploads|HTTP/1.1|200|-|
|
Reporter: Mark Seitter
Assignee: Antoine Pitrou / @pitrou
PRs and other links:
Note: This issue was originally created as ARROW-14523. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: