Retry batch failures #3930

bk2204 · 2019-11-22T19:58:24Z

If the batch operation fails due to an error instead of a bad HTTP status code, we'll abort the batch operation and retry. This appears to be a regression from 1412d6e ("Don't fail if we lack objects the server has", 2019-04-30), which caused us to handle errors differently.

Since there are two error returns from enqueueAndCollectRetriesFor, let's wrap the batch error case as a retriable error and not abort if we find a retriable error later on. This lets us continue to abort if we get a missing object, which should be fatal, but retry in the more common network failure case.

/cc #3929
/cc @bluekeyes as reporter

chrisd8088

This seems reasonable to me, after a brief detour through (some of) the relevant history here -- thank you for the detailed notes!

As far as I can tell, the comments above enqueueAndCollectRetriesFor() are still accurate as to the intended functionality, and the evolution of the serious error check in collectBatches() through #3634 and #3800 makes sense.

My only question would be one of those tedious irritating ones as to whether we should implement any additional tests -- and I realize that might be a tall order, as on a first quick pass I don't see anything in the current suite which be easily adapted (but I would be happy to be wrong about that). On the assumption it's not straightforward, though, I think it's fine to merge this as it stands.

Thanks as always!! 👍

bk2204 · 2019-12-05T15:17:14Z

The question about the tests is a good one. Unfortunately, the error case we're handling here is when we get unexpected errors; that is, where we don't get an HTTP response. Go's HTTP library doesn't really let me close the connection without a response (and this is not the first time I've wanted to do this in a test).

Let me see if I can come up with a test that can emulate something with an unexpected HTTP status code, which if possible would be better than nothing.

bk2204 · 2019-12-05T16:52:13Z

I've just checked, and I don't seem able to synthesize this with a 500. That error is flagged as non-retriable, which means the process aborts.

chrisd8088 · 2019-12-06T00:22:25Z

Fair enough -- thank you for checking! This did seem to fall into the category of tests which would require more of an integration test framework, and possibly some low-level plumbing. As I said, this looks fine to merge as-is:

On the assumption it's not straightforward, though, I think it's fine to merge this as it stands.

If the batch operation fails due to an error instead of a bad HTTP status code, we'll abort the batch operation and retry. This appears to be a regression from 1412d6e ("Don't fail if we lack objects the server has", 2019-04-30), which caused us to handle errors differently. Since there are two error returns from enqueueAndCollectRetriesFor, let's wrap the batch error case as a retriable error and not abort if we find a retriable error later on. This lets us continue to abort if we get a missing object, which should be fatal, but retry in the more common network failure case.

Retry batch failures

A prior commit in this PR resolves a bug where a 429 response to an upload or download request causes a Go panic in the client if the response lacks a Retry-After header. The same condition, when it occurs in the response to a batch API request, does not trigger a Go panic; instead, though, we simply fail without retrying the batch API request at all. This stands in constrast to how we now handle 429 responses for object uploads and downloads when no Retry-After header is provided, because in that case, we perform multiple retries, following the exponential backoff logic introduced in PR git-lfs#4097. This difference stems in part from the fact that the download() function of the basicDownloadAdapter structure and the DoTransfer() function of the basicUploadAdapter structure both handle 429 responses by first calling the NewRetriableLaterError() function of the "errors" package to try to parse any Retry-After header, and if that returns nil, then calling the NewRetriableError() function, so they always return some form of retriable error after a 429 status code is received. We therefore modify the handleResponse() method of the Client structure in the "lfshttp" package to likewise always return a retriable error of some kind after a 429 response. If a Retry-After header is found and is able to be parsed, then a retriableLaterError (from the "errors" package) is returned; otherwise, a generic retriableError is returned. This change is not sufficient on its own, however. When the batch API returns 429 responses without a Retry-After header, the transfer queue now retries its requests following the exponential backoff logic, as we expect. If one of those eventually succeeds, though, the batch is still processed as if it encountered an unrecoverable failure, and the Git LFS client ultimately returns a non-zero exit code. The reason this occurs is because the enqueueAndCollectRetriesFor() method of the TransferQueue structure in the "tq" package sets the flag which causes it to return an error for the batch both when an object in the batch cannot be retried (because it has reached its retry limit) or when an object in the batch can be retried but no specific retry wait time was provided by a retriableLaterError. The latter of these two cases is what is now triggered when the batch API returns a 429 status code and no Retry-After header. In commit a3ecbcc of PR git-lfs#4573 this code was updated to improve how batch API 429 responses with Retry-After headers are handled, building on the original code introduced in PR git-lfs#3449 and some fixes in PR git-lfs#3930. This commit added the flag, named hasNonScheduledErrors, which is set if any objects in a batch which experiences an error either can not be retried, or can be retried but don't have a specific wait time as provided by a Retry-After header. If the flag is set, then the error encountered during the processing of the batch is returned by the enqueueAndCollectRetriesFor() method, and although it is wrapped by NewRetriableError function, because the error is returned instead of just a nil, it is collected into the errors channel of the queue by the collectBatches() caller method, and this ultimately causes the client to report the error and return a non-zero exit code. By constrast, the handleTransferResult() method of the TransferQueue structure treats retriable errors from individual object uploads and downloads in the same way for both errors with a specified wait time and those without. To bring our handling of batch API requests into alignment with this approach, we can simply avoid setting the flag variable when a batch encounters an error and an object can be retried but without a specified wait time. We also rename the flag variable to hasNonRetriableObjects, which better reflects its meaning, as it signals the fact that at least one object in the batch can not be retried. As well, we update some related comments to clarify the current actions and intent of this section of code in the enqueueAndCollectRetriesFor() method. We then add a test to the t/t-batch-retries-ratelimit.sh test suite like the ones we added to the t/t-batch-storage-retries-ratelimit.sh script in a previous commit in this PR. The test relies on a new sentinel value in the test repository name which now recognize in our lfstest-gitserver test server, and which causes the test server to return a 429 response to batch API requests, but without a Retry-After header. This test fails without both of the changes we make in this commit to ensure we handle 429 batch API responses without Retry-After headers.

bluekeyes mentioned this pull request Nov 22, 2019

Retries failed with "tls: use of closed connection" error #3929

Closed

bk2204 added this to the v2.9.2 milestone Nov 25, 2019

bk2204 force-pushed the retriable-batch-failures branch from 1144591 to 62394ab Compare November 26, 2019 22:33

bk2204 marked this pull request as ready for review December 4, 2019 15:31

bk2204 requested a review from a team December 4, 2019 15:31

chrisd8088 approved these changes Dec 5, 2019

View reviewed changes

bk2204 force-pushed the retriable-batch-failures branch from 62394ab to e967e9a Compare December 6, 2019 18:36

bk2204 force-pushed the retriable-batch-failures branch from e967e9a to 1157c0d Compare December 6, 2019 19:28

bk2204 merged commit eb9ec66 into git-lfs:master Dec 6, 2019

bk2204 added a commit that referenced this pull request Dec 9, 2019

Merge pull request #3930 from bk2204/retriable-batch-failures

40e3138

Retry batch failures

bk2204 mentioned this pull request Aug 10, 2021

Fix 429 retry-after handling for LFS batch API endpoint #4573

Merged

chrisd8088 mentioned this pull request Jun 19, 2024

Fix panic caused by accessing non-existent header #5804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry batch failures #3930

Retry batch failures #3930

bk2204 commented Nov 22, 2019 •

edited

Loading

chrisd8088 left a comment •

edited

Loading

bk2204 commented Dec 5, 2019

bk2204 commented Dec 5, 2019

chrisd8088 commented Dec 6, 2019

Retry batch failures #3930

Retry batch failures #3930

Conversation

bk2204 commented Nov 22, 2019 • edited Loading

chrisd8088 left a comment • edited Loading

Choose a reason for hiding this comment

bk2204 commented Dec 5, 2019

bk2204 commented Dec 5, 2019

chrisd8088 commented Dec 6, 2019

bk2204 commented Nov 22, 2019 •

edited

Loading

chrisd8088 left a comment •

edited

Loading