Retry after all S3 get failures that made progress #88015

DaveCTurner · 2022-06-24T13:43:47Z

S3 sometimes enters a state where blob downloads repeatedly fail but
with nontrivial progress between failures. Often each attempt yields 10s
or 100s of MBs of data. Today we abort a download after three (by
default) such failures, but this may not be enough to completely
retrieve a large blob during one of these flaky patches.

With this commit we start to avoid counting download attempts that
retrieved at least 1% of the configured buffer_size (typically 1MB)
towards the maximum number of retries.

Closes #87243

S3 sometimes enters a state where blob downloads repeatedly fail but with nontrivial progress between failures. Often each attempt yields 10s or 100s of MBs of data. Today we abort a download after three (by default) such failures, but this may not be enough to completely retrieve a large blob during one of these flaky patches. With this commit we start to avoid counting download attempts that retrieved at least 1% of the configured `buffer_size` (typically 1MB) towards the maximum number of retries. Closes elastic#87243

elasticmachine · 2022-06-24T13:43:50Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2022-06-24T13:44:12Z

Hi @DaveCTurner, I've created a changelog YAML for you.

kingherc

LGTM. Left one optional comment.

Of course please do not count my sole review for now -- it's my first review. Thanks for adding me!

...les/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3RetryingInputStream.java

kingherc · 2022-06-24T14:17:21Z

...pository-s3/src/test/java/org/elasticsearch/repositories/s3/S3BlobContainerRetriesTests.java

+            private int failuresWithoutProgress;
+
+            @Override
+            public void handle(HttpExchange exchange) throws IOException {


So here basically you first exhaust the "meaningless" failures, and then have a series of failures with meaningful progress that should not finally throw an exception and the blob should be successfully read (either in its entirety or partly).

Yep, pretty much, although note that sendIncompleteContent will sometimes send a meaningful amount of data too.

…progress

DaveCTurner · 2022-06-27T09:45:07Z

@elasticmachine please run elasticsearch-ci/bwc - failure tracked at #87959

tlrx

LGTM

…progress

In elastic#88015 we made it so that downloads from S3 would sometimes retry more than the configured limit, if each attempt seemed to be making meaningful progress. This causes the failure of some assertions that the number of retries was exactly as expected. This commit weakens those assertions for S3 repositories. Closes elastic#88784 Closes elastic#88666

In #88015 we made it so that downloads from S3 would sometimes retry more than the configured limit, if each attempt seemed to be making meaningful progress. This causes the failure of some assertions that the number of retries was exactly as expected. This commit weakens those assertions for S3 repositories. Closes #88784 Closes #88666

DaveCTurner added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.4.0 labels Jun 24, 2022

DaveCTurner requested review from tlrx and kingherc June 24, 2022 13:43

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 24, 2022

Update docs/changelog/88015.yaml

4728b74

Simplier

7f8dbf0

kingherc approved these changes Jun 24, 2022

View reviewed changes

DaveCTurner added 2 commits June 27, 2022 10:12

Merge branch 'master' into 2022-06-24-S3RetryingInputStream-retry-on-…

f61e103

…progress

Reword

46b5d7c

tlrx approved these changes Jun 27, 2022

View reviewed changes

Merge branch 'master' into 2022-06-24-S3RetryingInputStream-retry-on-…

87d52b4

…progress

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 30, 2022

elasticsearchmachine merged commit 71aeebe into elastic:master Jun 30, 2022

DaveCTurner deleted the 2022-06-24-S3RetryingInputStream-retry-on-progress branch June 30, 2022 10:41

DaveCTurner mentioned this pull request Jul 26, 2022

Relax assertion about retry count for S3 repos #88801

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry after all S3 get failures that made progress #88015

Retry after all S3 get failures that made progress #88015

DaveCTurner commented Jun 24, 2022

elasticmachine commented Jun 24, 2022

elasticsearchmachine commented Jun 24, 2022

kingherc left a comment

kingherc Jun 24, 2022

DaveCTurner Jun 24, 2022

DaveCTurner commented Jun 27, 2022

tlrx left a comment

Retry after all S3 get failures that made progress #88015

Retry after all S3 get failures that made progress #88015

Conversation

DaveCTurner commented Jun 24, 2022

elasticmachine commented Jun 24, 2022

elasticsearchmachine commented Jun 24, 2022

kingherc left a comment

Choose a reason for hiding this comment

kingherc Jun 24, 2022

Choose a reason for hiding this comment

DaveCTurner Jun 24, 2022

Choose a reason for hiding this comment

DaveCTurner commented Jun 27, 2022

tlrx left a comment

Choose a reason for hiding this comment