kvserver: always return NLHE on lease acquisition timeouts #84865

erikgrinaker · 2022-07-21T18:43:19Z

In ab74b97 we added internal timeouts for lease acquisitions. These
were wrapped in RunWithTimeout(), as mandated for context timeouts.
However, this would mask the returned NotLeaseHolderError as a
TimeoutError, preventing the DistSender from retrying it and instead
propagating it out to the client. Additionally, context cancellation
errors from the actual RPC call were never wrapped as a
NotLeaseHolderError in the first place.

This ended up only happening in a very specific scenario where the outer
timeout added to the client context did not trigger, but the inner
timeout for the coalesced request context did trigger while the lease
request was in flight. Accidentally, the outer RunWithTimeout() call
did not return the roachpb.Error from the closure but instead passed
it via a captured variable, bypassing the error wrapping.

This patch replaces the RunWithTimeout() calls with regular
context.WithTimeout() calls to avoid the error wrapping, and returns a
NotLeaseHolderError from requestLease() if the RPC request fails and
the context was cancelled (presumably causing the error). Another option
would be to extract an NLHE from the error chain, but this would require
correct propagation of the structured error chain across RPC boundaries,
so out of an abundance of caution and with an eye towards backports, we
instead choose to return a bare NotLeaseHolderError.

The empty lease in the returned error prevents the DistSender from
updating its caches on context cancellation.

Resolves #84258.
Resolves #85115.

Release note (bug fix): Fixed a bug where clients could sometimes
receive errors due to lease acquisition timeouts of the form
operation "storage.pendingLeaseRequest: requesting lease" timed out after 6s.

cockroach-teamcity · 2022-07-21T18:43:28Z

This change is

nvanbenschoten

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

	timeout := 2 * r.store.cfg.RaftElectionTimeout()

	// Does not use RunWithTimeout(), because we do not want to mask the

Why do we have two different, nested timeouts? The first is here (redirectOnOrAcquireLeaseForRequest) and the second is in (requestLeaseAsync), which is called by this function. Isn't the timeout here sufficient to accomplish what you wanted in ab74b97? The context deadline handling in redirectOnOrAcquireLeaseForRequestWithoutTimeout already correctly promotes a ctx err to a NLHE.

I'm sure there's a good reason, I just can't find it in #81136.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why do we have two different, nested timeouts? The first is here (redirectOnOrAcquireLeaseForRequest) and the second is in (requestLeaseAsync), which is called by this function. Isn't the timeout here sufficient to accomplish what you wanted in ab74b97? The context deadline handling in redirectOnOrAcquireLeaseForRequestWithoutTimeout already correctly promotes a ctx err to a NLHE.

I'm sure there's a good reason, I just can't find it in #81136.

requestLeaseAsync (with the inner timeout) uses a base context that's tied to the number of waiting callers, cancelling when it drops to 0. Even if every caller sets its own timeout, if there is a steady influx of callers then the inner context may never get cancelled and the lease attempt would never return.

You might ask why the inner timeout isn't sufficient then, since all callers will receive an error when it triggers. This was done out of abundance of caution, in case any of the other operations in redirectOnOrAcquireLeaseForRequest might stall (haven't found any likely culprits), or if the coalesced context handling in requestLeaseAsync might fail to return an error for whatever reason. We should be able to remove it if you have a strong opinion on it, but I figured better safe than sorry given the consequences of a stall here.

pav-kv · 2022-07-27T16:48:55Z

pkg/kv/kvserver/replica_range_lease.go

+			// NotLeaseHolderError on context cancellation.
+			requestLeaseCtx, requestLeaseCancel := context.WithTimeout(ctx, timeout) // nolint:context
+			defer requestLeaseCancel()
+			err := p.requestLease(requestLeaseCtx, nextLeaseHolder, reqLease, status, leaseReq)


Did the outer code rely on this err being a TimeoutError in case of a timeout? Cause now it won't be anymore. Do you need to wrap it somewhere down from this line, similarly to how RunWithTimeout did?

No, rather the opposite -- the upper-level code relied on it being a NotLeaseHolderError, which got masked by the TimeoutError.

So, in the scenario when it's just a pure timeout (with no NotLeaseHolderErrorracing with it), or a timeout with a non-NLHE error, is it not necessary to wrap into TimeoutError?

We want a lease acquisition timeout to result in a NLHE, so that the DistSender will go try some other replica that may be able to acquire the lease instead. We therefore explicitly construct a NLHE on context cancellation. In the case where the client context was cancelled (e.g. by a disconnect), the DistSender will bail out anyway, around here:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 2248 to 2262 in f77e393

// Has the caller given up?

if ctx.Err() != nil {

// Don't consider this a sendError, because sendErrors indicate that we

// were unable to reach a replica that could serve the request, and they

// cause range cache evictions. Context cancellations just mean the

// sender changed its mind or the request timed out.

if ambiguousError != nil {

err = roachpb.NewAmbiguousResultError(errors.Wrapf(ambiguousError, "context done during DistSender.Send"))

} else {

err = errors.Wrap(ctx.Err(), "aborted during DistSender.Send")

}

log.Eventf(ctx, "%v", err)

return nil, err

}

For reference, the DistSender is a fat KV RPC client that's responsible for e.g. routing KV requests to the correct ranges and leaseholders.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @pavelkalinnikov)

pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

if there is a steady influx of callers then the inner context may never get cancelled and the lease attempt would never return

Is this a problem? Is there a reason why we believe a canceled and frequently retried lease acquisition attempt has a better chance of succeeding on a given replica than a lease request that is allowed to run for longer than any single caller is willing to wait?

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)

pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

if there is a steady influx of callers then the inner context may never get cancelled and the lease attempt would never return

Is this a problem? Is there a reason why we believe a canceled and frequently retried lease acquisition attempt has a better chance of succeeding on a given replica than a lease request that is allowed to run for longer than any single caller is willing to wait?

Really depends on the specific failure mode, I think. Seems worthwhile to at least give it a shot. Is there a reason to believe it would never help? Is there a downside to it? We would only be able to launch a new request if the previous cancellation took effect anyway, so we shouldn't be risking any pileups.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @pavelkalinnikov)

pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Really depends on the specific failure mode, I think. Seems worthwhile to at least give it a shot. Is there a reason to believe it would never help? Is there a downside to it? We would only be able to launch a new request if the previous cancellation took effect anyway, so we shouldn't be risking any pileups.

I'm just questioning added complexity (two layers of timeouts and custom timeout handling at each level that we need to keep in sync) which doesn't appear to be necessary. It's the kind of thing that we should be clear about the benefit of (or lack thereof) when adding or we'll never be able to remove it in the future.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)

pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm just questioning added complexity (two layers of timeouts and custom timeout handling at each level that we need to keep in sync) which doesn't appear to be necessary. It's the kind of thing that we should be clear about the benefit of (or lack thereof) when adding or we'll never be able to remove it in the future.

I hear you.

The outer timeout sufficiently addresses the primary problem we're trying to solve, in that it will ensure callers eventually receive an NLHE and go try another replica regardless of what happens to the lease request itself.

The inner timeout attempts to address a separate, secondary concern that a lease request could get stuck, which could prevent that replica from ever acquiring the lease. You could argue that we have sufficient downstream error handling and retry mechanisms to prevent this, but I still think it would be prudent. It isn't really that different from having separate client-side and server-side timeouts, and shouldn't require keeping anything in sync (if either fires, it independently returns a NLHE).

That said, this is a secondary problem, so I'm fine with dropping it for now and revisiting if we see it happening in the wild. Will update the PR shortly.

In ab74b97 we added internal timeouts for lease acquisitions. These were wrapped in `RunWithTimeout()`, as mandated for context timeouts. However, this would mask the returned `NotLeaseHolderError` as a `TimeoutError`, preventing the DistSender from retrying it and instead propagating it out to the client. Additionally, context cancellation errors from the actual RPC call were never wrapped as a `NotLeaseHolderError` in the first place. This ended up only happening in a very specific scenario where the outer timeout added to the client context did not trigger, but the inner timeout for the coalesced request context did trigger while the lease request was in flight. Accidentally, the outer `RunWithTimeout()` call did not return the `roachpb.Error` from the closure but instead passed it via a captured variable, bypassing the error wrapping. This patch replaces the `RunWithTimeout()` calls with regular `context.WithTimeout()` calls to avoid the error wrapping. Another option would be to extract an NLHE from the error chain, but this would require correct propagation of the structured error chain across RPC boundaries, so with an eye towards backports we instead choose to return a bare `NotLeaseHolderError`. The patch also removes the inner lease request timeout, since it has questionable benefits and the outer timeout is sufficient to prevent leases getting stuck for the overall range. Release note (bug fix): Fixed a bug where clients could sometimes receive errors due to lease acquisition timeouts of the form `operation "storage.pendingLeaseRequest: requesting lease" timed out after 6s`.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)

pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I hear you.

The outer timeout sufficiently addresses the primary problem we're trying to solve, in that it will ensure callers eventually receive an NLHE and go try another replica regardless of what happens to the lease request itself.

The inner timeout attempts to address a separate, secondary concern that a lease request could get stuck, which could prevent that replica from ever acquiring the lease. You could argue that we have sufficient downstream error handling and retry mechanisms to prevent this, but I still think it would be prudent. It isn't really that different from having separate client-side and server-side timeouts, and shouldn't require keeping anything in sync (if either fires, it independently returns a NLHE).

That said, this is a secondary problem, so I'm fine with dropping it for now and revisiting if we see it happening in the wild. Will update the PR shortly.

Updated, PTAL.

nvanbenschoten

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pavelkalinnikov)

erikgrinaker · 2022-08-01T16:49:29Z

TFTR!

bors r=nvanbenschoten

craig · 2022-08-01T20:12:14Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-08-01T20:12:21Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 067e740 to blathers/backport-release-21.2-84865: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

error creating merge commit from 067e740 to blathers/backport-release-22.1-84865: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

erikgrinaker added backport-21.2.x labels Jul 21, 2022

erikgrinaker requested a review from nvanbenschoten July 21, 2022 18:43

erikgrinaker requested a review from a team as a code owner July 21, 2022 18:43

erikgrinaker self-assigned this Jul 21, 2022

erikgrinaker force-pushed the nlhe-context-timeout branch 2 times, most recently from 40fb3d8 to 05c720c Compare July 22, 2022 10:52

nvanbenschoten reviewed Jul 26, 2022

View reviewed changes

erikgrinaker commented Jul 27, 2022

View reviewed changes

erikgrinaker force-pushed the nlhe-context-timeout branch from 05c720c to 8c690ca Compare July 27, 2022 09:43

erikgrinaker requested a review from pav-kv July 27, 2022 10:27

pav-kv reviewed Jul 27, 2022

View reviewed changes

nvanbenschoten reviewed Jul 27, 2022

View reviewed changes

erikgrinaker commented Jul 27, 2022

View reviewed changes

nvanbenschoten reviewed Aug 1, 2022

View reviewed changes

erikgrinaker commented Aug 1, 2022

View reviewed changes

erikgrinaker force-pushed the nlhe-context-timeout branch from 8c690ca to 067e740 Compare August 1, 2022 15:16

erikgrinaker commented Aug 1, 2022

View reviewed changes

nvanbenschoten approved these changes Aug 1, 2022

View reviewed changes

craig bot merged commit 314baa5 into cockroachdb:master Aug 1, 2022

This was referenced Aug 1, 2022

release-22.1: kvserver: always return NLHE on lease acquisition timeouts #85428

Merged

release-21.2: kvserver: always return NLHE on lease acquisition timeouts #85429

Merged

erikgrinaker deleted the nlhe-context-timeout branch August 5, 2022 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: always return NLHE on lease acquisition timeouts #84865

kvserver: always return NLHE on lease acquisition timeouts #84865

erikgrinaker commented Jul 21, 2022 •

edited

Loading

cockroach-teamcity commented Jul 21, 2022

nvanbenschoten left a comment

erikgrinaker left a comment

pav-kv Jul 27, 2022 •

edited

Loading

erikgrinaker Jul 27, 2022

pav-kv Jul 27, 2022

erikgrinaker Jul 27, 2022

nvanbenschoten left a comment

erikgrinaker left a comment

nvanbenschoten left a comment

erikgrinaker left a comment

erikgrinaker left a comment

nvanbenschoten left a comment

erikgrinaker commented Aug 1, 2022

craig bot commented Aug 1, 2022

blathers-crl bot commented Aug 1, 2022

	// Has the caller given up?
	if ctx.Err() != nil {
	// Don't consider this a sendError, because sendErrors indicate that we
	// were unable to reach a replica that could serve the request, and they
	// cause range cache evictions. Context cancellations just mean the
	// sender changed its mind or the request timed out.

	if ambiguousError != nil {
	err = roachpb.NewAmbiguousResultError(errors.Wrapf(ambiguousError, "context done during DistSender.Send"))
	} else {
	err = errors.Wrap(ctx.Err(), "aborted during DistSender.Send")
	}
	log.Eventf(ctx, "%v", err)
	return nil, err
	}

kvserver: always return NLHE on lease acquisition timeouts #84865

kvserver: always return NLHE on lease acquisition timeouts #84865

Conversation

erikgrinaker commented Jul 21, 2022 • edited Loading

cockroach-teamcity commented Jul 21, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

pav-kv Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

erikgrinaker Jul 27, 2022

Choose a reason for hiding this comment

pav-kv Jul 27, 2022

Choose a reason for hiding this comment

erikgrinaker Jul 27, 2022

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker commented Aug 1, 2022

craig bot commented Aug 1, 2022

blathers-crl bot commented Aug 1, 2022

erikgrinaker commented Jul 21, 2022 •

edited

Loading

pav-kv Jul 27, 2022 •

edited

Loading