Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(writer): retry storage.put on temporary network errors #2207

Merged
merged 3 commits into from
Mar 1, 2024

Conversation

qinix
Copy link
Contributor

@qinix qinix commented Feb 24, 2024

Description

Read-only operations are retried by [ObjectStore] internally. However, PUT/DELETE operations
are not retried even thought they are technically idempotent. [ObjectStore] does not retry
those operations because having preconditions may produce different results for the same
request. PUT/DELETE operations without preconditions are idempotent and can be retried.
Unfortunately, [ObjectStore]'s retry mechanism only works on HTTP request level, thus there
is no way to distinguish whether a request has preconditions or not.

This PR provides additional methods for working with [ObjectStore] that automatically retry
unconditional operations when they fail, and use these methods in writer to retry on temporary
network errors.

Related Issue(s)

See

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Feb 24, 2024
@rtyler rtyler self-assigned this Feb 24, 2024
@rtyler rtyler added this to the Rust v0.18 milestone Feb 24, 2024
@rtyler
Copy link
Member

rtyler commented Feb 24, 2024

The code largely looks good to me @qinix. I have reviewed what object_store::Error has , and since this is only retrying on the Generic variant, I'm not sure what types of failure conditions you're seeing in the real world?

@qinix
Copy link
Contributor Author

qinix commented Feb 25, 2024

I encountered the following error (reformatted for readability):

called `Result::unwrap()` on an `Err` value: ObjectStore { 
    source: Generic { 
        store: "S3", 
        source: Reqwest { 
            retries: 0, 
            max_retries: 10, 
            elapsed: 30.000858021s, 
            retry_timeout: 180s, 
            source: reqwest::Error {
                kind: Request,
                url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("REDACTED")), port: None, path: "/table/REDACTED.zstd.parquet", query: None, fragment: None },
                source: TimedOut
            }
        }
    }
}

Inside the Generic error it was object_store::client::retry::Error. I can't downcast to this concrete type as it's not publicly exported (the mod object_store::client is private).

I dislike the way of retrying Generic error, however it's impossible to know the concrete error without modifying object_store to export the concrete error or doing the err.to_string().contains(...) trick.

Besides of Generic error, other variants are returned from server side in normal logic, thus those should not be retried.

@rtyler rtyler enabled auto-merge (rebase) March 1, 2024 23:04
@rtyler rtyler merged commit 3d13238 into delta-io:main Mar 1, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants