-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: always return NLHE on lease acquisition timeouts #84865
Conversation
40fb3d8
to
05c720c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/replica_range_lease.go
line 1241 at r1 (raw file):
timeout := 2 * r.store.cfg.RaftElectionTimeout() // Does not use RunWithTimeout(), because we do not want to mask the
Why do we have two different, nested timeouts? The first is here (redirectOnOrAcquireLeaseForRequest
) and the second is in (requestLeaseAsync
), which is called by this function. Isn't the timeout here sufficient to accomplish what you wanted in ab74b97? The context deadline handling in redirectOnOrAcquireLeaseForRequestWithoutTimeout
already correctly promotes a ctx err to a NLHE.
I'm sure there's a good reason, I just can't find it in #81136.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go
line 1241 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why do we have two different, nested timeouts? The first is here (
redirectOnOrAcquireLeaseForRequest
) and the second is in (requestLeaseAsync
), which is called by this function. Isn't the timeout here sufficient to accomplish what you wanted in ab74b97? The context deadline handling inredirectOnOrAcquireLeaseForRequestWithoutTimeout
already correctly promotes a ctx err to a NLHE.I'm sure there's a good reason, I just can't find it in #81136.
requestLeaseAsync
(with the inner timeout) uses a base context that's tied to the number of waiting callers, cancelling when it drops to 0. Even if every caller sets its own timeout, if there is a steady influx of callers then the inner context may never get cancelled and the lease attempt would never return.
You might ask why the inner timeout isn't sufficient then, since all callers will receive an error when it triggers. This was done out of abundance of caution, in case any of the other operations in redirectOnOrAcquireLeaseForRequest
might stall (haven't found any likely culprits), or if the coalesced context handling in requestLeaseAsync
might fail to return an error for whatever reason. We should be able to remove it if you have a strong opinion on it, but I figured better safe than sorry given the consequences of a stall here.
05c720c
to
8c690ca
Compare
// NotLeaseHolderError on context cancellation. | ||
requestLeaseCtx, requestLeaseCancel := context.WithTimeout(ctx, timeout) // nolint:context | ||
defer requestLeaseCancel() | ||
err := p.requestLease(requestLeaseCtx, nextLeaseHolder, reqLease, status, leaseReq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the outer code rely on this err
being a TimeoutError
in case of a timeout? Cause now it won't be anymore. Do you need to wrap it somewhere down from this line, similarly to how RunWithTimeout
did?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, rather the opposite -- the upper-level code relied on it being a NotLeaseHolderError
, which got masked by the TimeoutError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in the scenario when it's just a pure timeout (with no NotLeaseHolderError
racing with it), or a timeout with a non-NLHE
error, is it not necessary to wrap into TimeoutError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want a lease acquisition timeout to result in a NLHE, so that the DistSender will go try some other replica that may be able to acquire the lease instead. We therefore explicitly construct a NLHE on context cancellation. In the case where the client context was cancelled (e.g. by a disconnect), the DistSender will bail out anyway, around here:
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 2248 to 2262 in f77e393
// Has the caller given up? | |
if ctx.Err() != nil { | |
// Don't consider this a sendError, because sendErrors indicate that we | |
// were unable to reach a replica that could serve the request, and they | |
// cause range cache evictions. Context cancellations just mean the | |
// sender changed its mind or the request timed out. | |
if ambiguousError != nil { | |
err = roachpb.NewAmbiguousResultError(errors.Wrapf(ambiguousError, "context done during DistSender.Send")) | |
} else { | |
err = errors.Wrap(ctx.Err(), "aborted during DistSender.Send") | |
} | |
log.Eventf(ctx, "%v", err) | |
return nil, err | |
} |
For reference, the DistSender is a fat KV RPC client that's responsible for e.g. routing KV requests to the correct ranges and leaseholders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @pavelkalinnikov)
pkg/kv/kvserver/replica_range_lease.go
line 1241 at r1 (raw file):
if there is a steady influx of callers then the inner context may never get cancelled and the lease attempt would never return
Is this a problem? Is there a reason why we believe a canceled and frequently retried lease acquisition attempt has a better chance of succeeding on a given replica than a lease request that is allowed to run for longer than any single caller is willing to wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)
pkg/kv/kvserver/replica_range_lease.go
line 1241 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
if there is a steady influx of callers then the inner context may never get cancelled and the lease attempt would never return
Is this a problem? Is there a reason why we believe a canceled and frequently retried lease acquisition attempt has a better chance of succeeding on a given replica than a lease request that is allowed to run for longer than any single caller is willing to wait?
Really depends on the specific failure mode, I think. Seems worthwhile to at least give it a shot. Is there a reason to believe it would never help? Is there a downside to it? We would only be able to launch a new request if the previous cancellation took effect anyway, so we shouldn't be risking any pileups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @pavelkalinnikov)
pkg/kv/kvserver/replica_range_lease.go
line 1241 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Really depends on the specific failure mode, I think. Seems worthwhile to at least give it a shot. Is there a reason to believe it would never help? Is there a downside to it? We would only be able to launch a new request if the previous cancellation took effect anyway, so we shouldn't be risking any pileups.
I'm just questioning added complexity (two layers of timeouts and custom timeout handling at each level that we need to keep in sync) which doesn't appear to be necessary. It's the kind of thing that we should be clear about the benefit of (or lack thereof) when adding or we'll never be able to remove it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)
pkg/kv/kvserver/replica_range_lease.go
line 1241 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm just questioning added complexity (two layers of timeouts and custom timeout handling at each level that we need to keep in sync) which doesn't appear to be necessary. It's the kind of thing that we should be clear about the benefit of (or lack thereof) when adding or we'll never be able to remove it in the future.
I hear you.
The outer timeout sufficiently addresses the primary problem we're trying to solve, in that it will ensure callers eventually receive an NLHE and go try another replica regardless of what happens to the lease request itself.
The inner timeout attempts to address a separate, secondary concern that a lease request could get stuck, which could prevent that replica from ever acquiring the lease. You could argue that we have sufficient downstream error handling and retry mechanisms to prevent this, but I still think it would be prudent. It isn't really that different from having separate client-side and server-side timeouts, and shouldn't require keeping anything in sync (if either fires, it independently returns a NLHE).
That said, this is a secondary problem, so I'm fine with dropping it for now and revisiting if we see it happening in the wild. Will update the PR shortly.
In ab74b97 we added internal timeouts for lease acquisitions. These were wrapped in `RunWithTimeout()`, as mandated for context timeouts. However, this would mask the returned `NotLeaseHolderError` as a `TimeoutError`, preventing the DistSender from retrying it and instead propagating it out to the client. Additionally, context cancellation errors from the actual RPC call were never wrapped as a `NotLeaseHolderError` in the first place. This ended up only happening in a very specific scenario where the outer timeout added to the client context did not trigger, but the inner timeout for the coalesced request context did trigger while the lease request was in flight. Accidentally, the outer `RunWithTimeout()` call did not return the `roachpb.Error` from the closure but instead passed it via a captured variable, bypassing the error wrapping. This patch replaces the `RunWithTimeout()` calls with regular `context.WithTimeout()` calls to avoid the error wrapping. Another option would be to extract an NLHE from the error chain, but this would require correct propagation of the structured error chain across RPC boundaries, so with an eye towards backports we instead choose to return a bare `NotLeaseHolderError`. The patch also removes the inner lease request timeout, since it has questionable benefits and the outer timeout is sufficient to prevent leases getting stuck for the overall range. Release note (bug fix): Fixed a bug where clients could sometimes receive errors due to lease acquisition timeouts of the form `operation "storage.pendingLeaseRequest: requesting lease" timed out after 6s`.
8c690ca
to
067e740
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)
pkg/kv/kvserver/replica_range_lease.go
line 1241 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I hear you.
The outer timeout sufficiently addresses the primary problem we're trying to solve, in that it will ensure callers eventually receive an NLHE and go try another replica regardless of what happens to the lease request itself.
The inner timeout attempts to address a separate, secondary concern that a lease request could get stuck, which could prevent that replica from ever acquiring the lease. You could argue that we have sufficient downstream error handling and retry mechanisms to prevent this, but I still think it would be prudent. It isn't really that different from having separate client-side and server-side timeouts, and shouldn't require keeping anything in sync (if either fires, it independently returns a NLHE).
That said, this is a secondary problem, so I'm fine with dropping it for now and revisiting if we see it happening in the wild. Will update the PR shortly.
Updated, PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pavelkalinnikov)
TFTR! bors r=nvanbenschoten |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 067e740 to blathers/backport-release-21.2-84865: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. error creating merge commit from 067e740 to blathers/backport-release-22.1-84865: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
In ab74b97 we added internal timeouts for lease acquisitions. These
were wrapped in
RunWithTimeout()
, as mandated for context timeouts.However, this would mask the returned
NotLeaseHolderError
as aTimeoutError
, preventing the DistSender from retrying it and insteadpropagating it out to the client. Additionally, context cancellation
errors from the actual RPC call were never wrapped as a
NotLeaseHolderError
in the first place.This ended up only happening in a very specific scenario where the outer
timeout added to the client context did not trigger, but the inner
timeout for the coalesced request context did trigger while the lease
request was in flight. Accidentally, the outer
RunWithTimeout()
calldid not return the
roachpb.Error
from the closure but instead passedit via a captured variable, bypassing the error wrapping.
This patch replaces the
RunWithTimeout()
calls with regularcontext.WithTimeout()
calls to avoid the error wrapping, and returns aNotLeaseHolderError
fromrequestLease()
if the RPC request fails andthe context was cancelled (presumably causing the error). Another option
would be to extract an NLHE from the error chain, but this would require
correct propagation of the structured error chain across RPC boundaries,
so out of an abundance of caution and with an eye towards backports, we
instead choose to return a bare
NotLeaseHolderError
.The empty lease in the returned error prevents the DistSender from
updating its caches on context cancellation.
Resolves #84258.
Resolves #85115.
Release note (bug fix): Fixed a bug where clients could sometimes
receive errors due to lease acquisition timeouts of the form
operation "storage.pendingLeaseRequest: requesting lease" timed out after 6s
.