-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable retries on wider classes of errors in object_store #5383
Conversation
@@ -263,7 +262,7 @@ impl RetryExt for reqwest::RequestBuilder { | |||
do_retry = true | |||
} else if let Some(source) = e.source() { | |||
if let Some(e) = source.downcast_ref::<hyper::Error>() { | |||
if e.is_connect() || e.is_closed() || e.is_incomplete_message() { | |||
if !(e.is_parse() || e.is_parse_status() || e.is_parse_too_large() || e.is_user() || e.is_canceled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be made an allowlist, we should only retry errors we know are safe to retry, otherwise we could end up retrying until we time out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean like it was previously? The problem I'm encountering is that there are no conditions exposed in hyper for https://docs.rs/hyper/latest/src/hyper/error.rs.html#478
.
I totally agree with you that this code is much more aggressive, though there seems to be no other way.
I think what can possibly justify this code is that we can't retry indefinitely since the retry limit has to be set by the caller of this code? I agree it's not ideal, though it seems like it's going to be quite a long time away before hyper will expose any more fine grained information about errors.
I'm happy to hear any other suggestions though.
Does it make sense having two tiers of errors, one with a lower retry limit for unknown sorts of errors that get caught by the conditions I inserted, and one with a higher limit (which was what the original code caught)? It's definitely uglier for consumers though.
Otherwise it might make sense having polars implement RetryExt
and then perhaps gating this RetryExt
implementation behind an optional feature? @ritchie46 curious what your thoughts are, I have this PR open to resolve pola-rs/polars#14384 which I almost invariably encounter when querying a large enough dataset in s3 (from an ec2 instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't help feeling if hyper doesn't expose this error variant, it perhaps isn't one we should be handling... I do wonder if the issue is that polars is being too aggressive with IO, have you tried using LimitStore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use tokio semaphore with a budget of 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a link to how this is wired up, is it possible the semaphore is stalling in-flight IO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for the potential fix in polars, so that we have some more information on the nature of the issue you are running into. If the fix works, the issue is contention related, and retrying is entirely the wrong thing to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR upstream that should restrict more concurrency: pola-rs/polars#14435
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why you think it's a rate limit issue, I thought s3 should 503 on rate limit as documented here?
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
Based off:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-timeouts-retries (Though this might also imply that object_store should have a configurable retry strategy on a per cloud provider basis?)
It seems like these errors are probably caused by an underlying issue in object_store/hyper/reqwest/polars usage of object_store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's only if your request actually gets to S3 and you trip then trip its limits, if you flood the network, e.g. by creating hundreds of TCP connections to handle hundreds of concurrent HTTP/1.1 requests, there are various components in the network that may just start dropping these connections on the floor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't help feeling if hyper doesn't expose this error variant, it perhaps isn't one we should be handling.
Perhaps @seanmonstar could clarify why those error kind types aren't exposed?
I'd agree with that feeling. Anything accessed through .source()
of a hyper error is not guaranteed (we may change internal dependencies). Exposing specific error classes is just something we haven't really designed (help welcome).
(No opinion otherwise on this PR, other than the general be careful with retrying unknown errors, especially if it could have meant side effects on the server.)
Marking this as a draft as pending further investigation |
The conclusion of pola-rs/polars#14598 appears to be that this was an upstream issue, so closing this. Feel free to reopen if I am mistaken |
Which issue does this PR close?
Closes #.
Rationale for this change
When connection is reset by peer, retries are ignored and your request fails. We should probably enable retry on these errors so that there is more robustness when using object_store for OLAP queries. We do not have access to the underlying
Kind
of the hyper error (https://github.com/hyperium/hyper/blob/0.14.x/src/error.rs#L19), so this is the best way of increasing the cases on which to retry on. This might be overly aggressive, though shouldn't be unsafe.What changes are included in this PR?
Now we retry by default unless it's one of:
Are there any user-facing changes?
No api changes, but will retry on a strict superset of errors that it used to retry on.