Skip to content

Commit

Permalink
lfshttp,tq,t: don't fail on retriable batch errors
Browse files Browse the repository at this point in the history
A prior commit in this PR resolves a bug where a 429 response to an
upload or download request causes a Go panic in the client if the
response lacks a Retry-After header.

The same condition, when it occurs in the response to a batch API
request, does not trigger a Go panic; instead, though, we simply
fail without retrying the batch API request at all.  This stands
in constrast to how we now handle 429 responses for object uploads
and downloads when no Retry-After header is provided, because in
that case, we perform multiple retries, following the exponential
backoff logic introduced in PR git-lfs#4097.

This difference stems in part from the fact that the download()
function of the basicDownloadAdapter structure and the DoTransfer()
function of the basicUploadAdapter structure both handle 429 responses
by first calling the NewRetriableLaterError() function of the "errors"
package to try to parse any Retry-After header, and if that returns
nil, then calling the NewRetriableError() function, so they always
return some form of retriable error after a 429 status code is received.

We therefore modify the handleResponse() method of the Client structure
in the "lfshttp" package to likewise always return a retriable error
of some kind after a 429 response.  If a Retry-After header is found
and is able to be parsed, then a retriableLaterError (from the "errors"
package) is returned; otherwise, a generic retriableError is returned.

This change is not sufficient on its own, however.  When the batch
API returns 429 responses without a Retry-After header, the transfer
queue now retries its requests following the exponential backoff
logic, as we expect.  If one of those eventually succeeds, though,
the batch is still processed as if it encountered an unrecoverable
failure, and the Git LFS client ultimately returns a non-zero exit code.

The reason this occurs is because the enqueueAndCollectRetriesFor()
method of the TransferQueue structure in the "tq" package sets the
flag which causes it to return an error for the batch both when an
object in the batch cannot be retried (because it has reached its
retry limit) or when an object in the batch can be retried but no
specific retry wait time was provided by a retriableLaterError.

The latter of these two cases is what is now triggered when the batch
API returns a 429 status code and no Retry-After header.  In commit
a3ecbcc of PR git-lfs#4573 this code was
updated to improve how batch API 429 responses with Retry-After headers
are handled, building on the original code introduced in PR git-lfs#3449
and some fixes in PR git-lfs#3930.  This commit added the flag, named
hasNonScheduledErrors, which is set if any objects in a batch which
experiences an error either can not be retried, or can be retried but
don't have a specific wait time as provided by a Retry-After header.
If the flag is set, then the error encountered during the processing
of the batch is returned by the enqueueAndCollectRetriesFor() method,
and although it is wrapped by NewRetriableError function, because the
error is returned instead of just a nil, it is collected into the errors
channel of the queue by the collectBatches() caller method, and this
ultimately causes the client to report the error and return a non-zero
exit code.

By constrast, the handleTransferResult() method of the TransferQueue
structure treats retriable errors from individual object uploads and
downloads in the same way for both errors with a specified wait time
and those without.

To bring our handling of batch API requests into alignment with
this approach, we can simply avoid setting the flag variable when
a batch encounters an error and an object can be retried but without
a specified wait time.

We also rename the flag variable to hasNonRetriableObjects, which
better reflects its meaning, as it signals the fact that at least
one object in the batch can not be retried.  As well, we update
some related comments to clarify the current actions and intent of
this section of code in the enqueueAndCollectRetriesFor() method.

We then add a test to the t/t-batch-retries-ratelimit.sh test suite
like the ones we added to the t/t-batch-storage-retries-ratelimit.sh
script in a previous commit in this PR.  The test relies on a new
sentinel value in the test repository name which now recognize in
our lfstest-gitserver test server, and which causes the test server
to return a 429 response to batch API requests, but without a
Retry-After header.  This test fails without both of the changes
we make in this commit to ensure we handle 429 batch API responses
without Retry-After headers.
  • Loading branch information
chrisd8088 committed Jun 19, 2024
1 parent 242fe2c commit 0f82147
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 7 deletions.
1 change: 1 addition & 0 deletions lfshttp/errors.go
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ func (c *Client) handleResponse(res *http.Response) error {
if retLaterErr != nil {
return retLaterErr
}
return errors.NewRetriableError(err)
}

if res.StatusCode > 499 && res.StatusCode != 501 && res.StatusCode != 507 && res.StatusCode != 509 {
Expand Down
10 changes: 10 additions & 0 deletions t/cmd/lfstest-gitserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -422,6 +422,16 @@ func lfsBatchHandler(w http.ResponseWriter, r *http.Request, id, repo string) {
}
}

if strings.HasSuffix(repo, "batch-retry-later-no-header") {
if _, isWaiting := checkRateLimit("batch", "", repo, ""); isWaiting {
w.WriteHeader(http.StatusTooManyRequests)

w.Write([]byte("rate limit reached"))
fmt.Println("Not setting Retry-After header")
return
}
}

res := []lfsObject{}
testingChunked := testingChunkedTransferEncoding(r)
testingTus := testingTusUploadInBatchReq(r)
Expand Down
26 changes: 26 additions & 0 deletions t/t-batch-retries-ratelimit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -130,3 +130,29 @@ begin_test "batch clone with multiple files causes retries"
popd
)
end_test

begin_test "batch upload causes retries (missing header)"
(
set -e

reponame="upload-batch-retry-later-no-header"
setup_remote_repo "$reponame"
clone_repo "$reponame" batch-repo-upload-no-header

contents="content"
oid="$(calc_oid "$contents")"
printf "%s" "$contents" > a.dat

git lfs track "*.dat"
git add .gitattributes a.dat
git commit -m "initial commit"

GIT_TRACE=1 git push origin main 2>&1 | tee push.log
if [ "0" -ne "${PIPESTATUS[0]}" ]; then
echo >&2 "fatal: expected \`git push origin main\` to succeed ..."
exit 1
fi

assert_server_object "$reponame" "$oid"
)
end_test
13 changes: 6 additions & 7 deletions tq/transfer_queue.go
Original file line number Diff line number Diff line change
Expand Up @@ -580,26 +580,25 @@ func (q *TransferQueue) enqueueAndCollectRetriesFor(batch batch) (batch, error)
var err error
bRes, err = Batch(q.manifest, q.direction, q.remote, q.ref, batch.ToTransfers())
if err != nil {
var hasNonScheduledErrors = false
var hasNonRetriableObjects = false
// If there was an error making the batch API call, mark all of
// the objects for retry, and return them along with the error
// that was encountered. If any of the objects couldn't be
// retried, they will be marked as failed.
// the objects for retry if possible. If any should not be retried,
// they will be marked as failed.
for _, t := range batch {
if q.canRetryObject(t.Oid, err) {
hasNonScheduledErrors = true
enqueueRetry(t, err, nil)
} else if readyTime, canRetry := q.canRetryObjectLater(t.Oid, err); canRetry {
enqueueRetry(t, err, &readyTime)
} else {
hasNonScheduledErrors = true
hasNonRetriableObjects = true
q.wait.Done()
}
}

// Only return error and mark operation as failure if at least one object
// was not enqueued for retrial at a later point.
if hasNonScheduledErrors {
// Make sure to return an error which causes all other objects to be retried.
if hasNonRetriableObjects {
return next, errors.NewRetriableError(err)
} else {
return next, nil
Expand Down

0 comments on commit 0f82147

Please sign in to comment.