release-22.1: kvserver: always return NLHE on lease acquisition timeouts #85428
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #84865.
/cc @cockroachdb/release
Release justification: improved error handling during lease acquisition.
In ab74b97 we added internal timeouts for lease acquisitions. These
were wrapped in
RunWithTimeout()
, as mandated for context timeouts.However, this would mask the returned
NotLeaseHolderError
as aTimeoutError
, preventing the DistSender from retrying it and insteadpropagating it out to the client. Additionally, context cancellation
errors from the actual RPC call were never wrapped as a
NotLeaseHolderError
in the first place.This ended up only happening in a very specific scenario where the outer
timeout added to the client context did not trigger, but the inner
timeout for the coalesced request context did trigger while the lease
request was in flight. Accidentally, the outer
RunWithTimeout()
calldid not return the
roachpb.Error
from the closure but instead passedit via a captured variable, bypassing the error wrapping.
This patch replaces the
RunWithTimeout()
calls with regularcontext.WithTimeout()
calls to avoid the error wrapping, and returns aNotLeaseHolderError
fromrequestLease()
if the RPC request fails andthe context was cancelled (presumably causing the error). Another option
would be to extract an NLHE from the error chain, but this would require
correct propagation of the structured error chain across RPC boundaries,
so out of an abundance of caution and with an eye towards backports, we
instead choose to return a bare
NotLeaseHolderError
.The empty lease in the returned error prevents the DistSender from
updating its caches on context cancellation.
Release note (bug fix): Fixed a bug where clients could sometimes
receive errors due to lease acquisition timeouts of the form
operation "storage.pendingLeaseRequest: requesting lease" timed out after 6s
.