Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: always return NLHE on lease acquisition timeouts #84865

Merged
merged 1 commit into from
Aug 1, 2022

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jul 21, 2022

In ab74b97 we added internal timeouts for lease acquisitions. These
were wrapped in RunWithTimeout(), as mandated for context timeouts.
However, this would mask the returned NotLeaseHolderError as a
TimeoutError, preventing the DistSender from retrying it and instead
propagating it out to the client. Additionally, context cancellation
errors from the actual RPC call were never wrapped as a
NotLeaseHolderError in the first place.

This ended up only happening in a very specific scenario where the outer
timeout added to the client context did not trigger, but the inner
timeout for the coalesced request context did trigger while the lease
request was in flight. Accidentally, the outer RunWithTimeout() call
did not return the roachpb.Error from the closure but instead passed
it via a captured variable, bypassing the error wrapping.

This patch replaces the RunWithTimeout() calls with regular
context.WithTimeout() calls to avoid the error wrapping, and returns a
NotLeaseHolderError from requestLease() if the RPC request fails and
the context was cancelled (presumably causing the error). Another option
would be to extract an NLHE from the error chain, but this would require
correct propagation of the structured error chain across RPC boundaries,
so out of an abundance of caution and with an eye towards backports, we
instead choose to return a bare NotLeaseHolderError.

The empty lease in the returned error prevents the DistSender from
updating its caches on context cancellation.

Resolves #84258.
Resolves #85115.

Release note (bug fix): Fixed a bug where clients could sometimes
receive errors due to lease acquisition timeouts of the form
operation "storage.pendingLeaseRequest: requesting lease" timed out after 6s.

@erikgrinaker erikgrinaker requested a review from a team as a code owner July 21, 2022 18:43
@erikgrinaker erikgrinaker self-assigned this Jul 21, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the nlhe-context-timeout branch 2 times, most recently from 40fb3d8 to 05c720c Compare July 22, 2022 10:52
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

	timeout := 2 * r.store.cfg.RaftElectionTimeout()

	// Does not use RunWithTimeout(), because we do not want to mask the

Why do we have two different, nested timeouts? The first is here (redirectOnOrAcquireLeaseForRequest) and the second is in (requestLeaseAsync), which is called by this function. Isn't the timeout here sufficient to accomplish what you wanted in ab74b97? The context deadline handling in redirectOnOrAcquireLeaseForRequestWithoutTimeout already correctly promotes a ctx err to a NLHE.

I'm sure there's a good reason, I just can't find it in #81136.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why do we have two different, nested timeouts? The first is here (redirectOnOrAcquireLeaseForRequest) and the second is in (requestLeaseAsync), which is called by this function. Isn't the timeout here sufficient to accomplish what you wanted in ab74b97? The context deadline handling in redirectOnOrAcquireLeaseForRequestWithoutTimeout already correctly promotes a ctx err to a NLHE.

I'm sure there's a good reason, I just can't find it in #81136.

requestLeaseAsync (with the inner timeout) uses a base context that's tied to the number of waiting callers, cancelling when it drops to 0. Even if every caller sets its own timeout, if there is a steady influx of callers then the inner context may never get cancelled and the lease attempt would never return.

You might ask why the inner timeout isn't sufficient then, since all callers will receive an error when it triggers. This was done out of abundance of caution, in case any of the other operations in redirectOnOrAcquireLeaseForRequest might stall (haven't found any likely culprits), or if the coalesced context handling in requestLeaseAsync might fail to return an error for whatever reason. We should be able to remove it if you have a strong opinion on it, but I figured better safe than sorry given the consequences of a stall here.

@erikgrinaker erikgrinaker force-pushed the nlhe-context-timeout branch from 05c720c to 8c690ca Compare July 27, 2022 09:43
@erikgrinaker erikgrinaker requested a review from pav-kv July 27, 2022 10:27
// NotLeaseHolderError on context cancellation.
requestLeaseCtx, requestLeaseCancel := context.WithTimeout(ctx, timeout) // nolint:context
defer requestLeaseCancel()
err := p.requestLease(requestLeaseCtx, nextLeaseHolder, reqLease, status, leaseReq)
Copy link
Collaborator

@pav-kv pav-kv Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the outer code rely on this err being a TimeoutError in case of a timeout? Cause now it won't be anymore. Do you need to wrap it somewhere down from this line, similarly to how RunWithTimeout did?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, rather the opposite -- the upper-level code relied on it being a NotLeaseHolderError, which got masked by the TimeoutError.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in the scenario when it's just a pure timeout (with no NotLeaseHolderErrorracing with it), or a timeout with a non-NLHE error, is it not necessary to wrap into TimeoutError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want a lease acquisition timeout to result in a NLHE, so that the DistSender will go try some other replica that may be able to acquire the lease instead. We therefore explicitly construct a NLHE on context cancellation. In the case where the client context was cancelled (e.g. by a disconnect), the DistSender will bail out anyway, around here:

// Has the caller given up?
if ctx.Err() != nil {
// Don't consider this a sendError, because sendErrors indicate that we
// were unable to reach a replica that could serve the request, and they
// cause range cache evictions. Context cancellations just mean the
// sender changed its mind or the request timed out.
if ambiguousError != nil {
err = roachpb.NewAmbiguousResultError(errors.Wrapf(ambiguousError, "context done during DistSender.Send"))
} else {
err = errors.Wrap(ctx.Err(), "aborted during DistSender.Send")
}
log.Eventf(ctx, "%v", err)
return nil, err
}

For reference, the DistSender is a fat KV RPC client that's responsible for e.g. routing KV requests to the correct ranges and leaseholders.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @pavelkalinnikov)


pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

if there is a steady influx of callers then the inner context may never get cancelled and the lease attempt would never return

Is this a problem? Is there a reason why we believe a canceled and frequently retried lease acquisition attempt has a better chance of succeeding on a given replica than a lease request that is allowed to run for longer than any single caller is willing to wait?

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)


pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

if there is a steady influx of callers then the inner context may never get cancelled and the lease attempt would never return

Is this a problem? Is there a reason why we believe a canceled and frequently retried lease acquisition attempt has a better chance of succeeding on a given replica than a lease request that is allowed to run for longer than any single caller is willing to wait?

Really depends on the specific failure mode, I think. Seems worthwhile to at least give it a shot. Is there a reason to believe it would never help? Is there a downside to it? We would only be able to launch a new request if the previous cancellation took effect anyway, so we shouldn't be risking any pileups.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @pavelkalinnikov)


pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Really depends on the specific failure mode, I think. Seems worthwhile to at least give it a shot. Is there a reason to believe it would never help? Is there a downside to it? We would only be able to launch a new request if the previous cancellation took effect anyway, so we shouldn't be risking any pileups.

I'm just questioning added complexity (two layers of timeouts and custom timeout handling at each level that we need to keep in sync) which doesn't appear to be necessary. It's the kind of thing that we should be clear about the benefit of (or lack thereof) when adding or we'll never be able to remove it in the future.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)


pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm just questioning added complexity (two layers of timeouts and custom timeout handling at each level that we need to keep in sync) which doesn't appear to be necessary. It's the kind of thing that we should be clear about the benefit of (or lack thereof) when adding or we'll never be able to remove it in the future.

I hear you.

The outer timeout sufficiently addresses the primary problem we're trying to solve, in that it will ensure callers eventually receive an NLHE and go try another replica regardless of what happens to the lease request itself.

The inner timeout attempts to address a separate, secondary concern that a lease request could get stuck, which could prevent that replica from ever acquiring the lease. You could argue that we have sufficient downstream error handling and retry mechanisms to prevent this, but I still think it would be prudent. It isn't really that different from having separate client-side and server-side timeouts, and shouldn't require keeping anything in sync (if either fires, it independently returns a NLHE).

That said, this is a secondary problem, so I'm fine with dropping it for now and revisiting if we see it happening in the wild. Will update the PR shortly.

In ab74b97 we added internal timeouts for lease acquisitions. These
were wrapped in `RunWithTimeout()`, as mandated for context timeouts.
However, this would mask the returned `NotLeaseHolderError` as a
`TimeoutError`, preventing the DistSender from retrying it and instead
propagating it out to the client. Additionally, context cancellation
errors from the actual RPC call were never wrapped as a
`NotLeaseHolderError` in the first place.

This ended up only happening in a very specific scenario where the outer
timeout added to the client context did not trigger, but the inner
timeout for the coalesced request context did trigger while the lease
request was in flight. Accidentally, the outer `RunWithTimeout()` call
did not return the `roachpb.Error` from the closure but instead passed
it via a captured variable, bypassing the error wrapping.

This patch replaces the `RunWithTimeout()` calls with regular
`context.WithTimeout()` calls to avoid the error wrapping. Another
option would be to extract an NLHE from the error chain, but this would
require correct propagation of the structured error chain across RPC
boundaries, so with an eye towards backports we instead choose to return
a bare `NotLeaseHolderError`.

The patch also removes the inner lease request timeout, since it has
questionable benefits and the outer timeout is sufficient to prevent
leases getting stuck for the overall range.

Release note (bug fix): Fixed a bug where clients could sometimes
receive errors due to lease acquisition timeouts of the form
`operation "storage.pendingLeaseRequest: requesting lease" timed out after 6s`.
@erikgrinaker erikgrinaker force-pushed the nlhe-context-timeout branch from 8c690ca to 067e740 Compare August 1, 2022 15:16
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)


pkg/kv/kvserver/replica_range_lease.go line 1241 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I hear you.

The outer timeout sufficiently addresses the primary problem we're trying to solve, in that it will ensure callers eventually receive an NLHE and go try another replica regardless of what happens to the lease request itself.

The inner timeout attempts to address a separate, secondary concern that a lease request could get stuck, which could prevent that replica from ever acquiring the lease. You could argue that we have sufficient downstream error handling and retry mechanisms to prevent this, but I still think it would be prudent. It isn't really that different from having separate client-side and server-side timeouts, and shouldn't require keeping anything in sync (if either fires, it independently returns a NLHE).

That said, this is a secondary problem, so I'm fine with dropping it for now and revisiting if we see it happening in the wild. Will update the PR shortly.

Updated, PTAL.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pavelkalinnikov)

@erikgrinaker
Copy link
Contributor Author

TFTR!

bors r=nvanbenschoten

@craig craig bot merged commit 314baa5 into cockroachdb:master Aug 1, 2022
@craig
Copy link
Contributor

craig bot commented Aug 1, 2022

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Aug 1, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 067e740 to blathers/backport-release-21.2-84865: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


error creating merge commit from 067e740 to blathers/backport-release-22.1-84865: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants