-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry after all S3 get failures that made progress #88015
Retry after all S3 get failures that made progress #88015
Conversation
S3 sometimes enters a state where blob downloads repeatedly fail but with nontrivial progress between failures. Often each attempt yields 10s or 100s of MBs of data. Today we abort a download after three (by default) such failures, but this may not be enough to completely retrieve a large blob during one of these flaky patches. With this commit we start to avoid counting download attempts that retrieved at least 1% of the configured `buffer_size` (typically 1MB) towards the maximum number of retries. Closes elastic#87243
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @DaveCTurner, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left one optional comment.
Of course please do not count my sole review for now -- it's my first review. Thanks for adding me!
...les/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3RetryingInputStream.java
Outdated
Show resolved
Hide resolved
private int failuresWithoutProgress; | ||
|
||
@Override | ||
public void handle(HttpExchange exchange) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here basically you first exhaust the "meaningless" failures, and then have a series of failures with meaningful progress that should not finally throw an exception and the blob should be successfully read (either in its entirety or partly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, pretty much, although note that sendIncompleteContent
will sometimes send a meaningful amount of data too.
@elasticmachine please run elasticsearch-ci/bwc - failure tracked at #87959 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In elastic#88015 we made it so that downloads from S3 would sometimes retry more than the configured limit, if each attempt seemed to be making meaningful progress. This causes the failure of some assertions that the number of retries was exactly as expected. This commit weakens those assertions for S3 repositories. Closes elastic#88784 Closes elastic#88666
In #88015 we made it so that downloads from S3 would sometimes retry more than the configured limit, if each attempt seemed to be making meaningful progress. This causes the failure of some assertions that the number of retries was exactly as expected. This commit weakens those assertions for S3 repositories. Closes #88784 Closes #88666
S3 sometimes enters a state where blob downloads repeatedly fail but
with nontrivial progress between failures. Often each attempt yields 10s
or 100s of MBs of data. Today we abort a download after three (by
default) such failures, but this may not be enough to completely
retrieve a large blob during one of these flaky patches.
With this commit we start to avoid counting download attempts that
retrieved at least 1% of the configured
buffer_size
(typically 1MB)towards the maximum number of retries.
Closes #87243