Skip to content

Commit

Permalink
kvcoord: optionally update range cache with descriptor returned in NLHE
Browse files Browse the repository at this point in the history
We recently started shipping the range descriptors back on
NotLeaseHolderError. Prior to this patch, we only ever used their
generation to elide certain lease updates if they originated from a
replica that had an older (stale) view of this world.

This patch goes a step further in making use of the returned range
descriptor -- we now update the client's range cache if the returned
range descriptor is newer than what existed on the client. We do this
by picking the freshest range descriptor and lease, independently, from
the range descriptor/lease what exist in the client's range cache and
what was returned in the NLHE. Though unlikely, independently choosing
the freshest lease/range descriptor can lead to cases where the lease is
not compatible with the range descriptor. If we detect this to be the
case, we empty out the lease and simply cache the freshest range
descriptor.

Previously, we always accepted speculative leases to be more recent than
anything already in the cache. Now, we discard speculative leases if
they're coming from a replica that has an older view of the world.
Morally, these semantics are the right way to conceptualize #72772,
given that hazard could only ever exist for speculative leases. #82802
also falls out as a special case of this conceptualization, and this
patch adds a regression test for the scenario where an uninitialized
replica returns a NLHE with a speculative lease pointing to a replica
that isn't part of the range.

We also get rid of optimizations where we would try to identify "stale"
range descriptors when updating lease information. These no longer make
sense given we don't just update the lease, we also update the range
descriptor. These optimizations also didn't account for all possible
cases, such as when the leaseholder was present on the range
descriptor, but as a LEARNER. See #75742 for more details about how
this hazard manifests.

With this new approach we also address #75742. Instead of invalidating
the routing when a descriptor is identified as stale and using that as
a proxy to bail early when routing to replicas, we instead switch to a
more direct approach -- if at any point we detect the leaseholder isn't
included on the transport, we exit the routing logic early, and retry
at a layer above. Given our new range cache update semantics using
descriptors on NLHE errors, it's quite likely this retry circumvents a
range descriptor lookup from meta2 -- instead, we'd expect this to
simply amount to trying again with a newly constructed transport. The
rationale is that even though the client may have an arbitrarily stale
view of the range, we never expect it to regress. Thus, if the
leaseholder was updated on the client and it doesn't exist on the
transport, it must be the case that the transport was constructed using
a stale range descriptor and stale leaseholder information. As such,
there is likely not much value in trying to exhaust the transport
(and potentially getting caught in backoff loops). We're much better
served by bailing early and trying again with a fresh transport.

Fixes #82802
Fixes #75742

Release note: None
Release justification: Fixes for high priority bugs in existing
functionality.
  • Loading branch information
arulajmani committed Aug 15, 2022
1 parent e84026c commit 5fb48d6
Show file tree
Hide file tree
Showing 7 changed files with 1,006 additions and 390 deletions.
41 changes: 30 additions & 11 deletions pkg/kv/kvclient/kvcoord/dist_sender.go
Original file line number Diff line number Diff line change
Expand Up @@ -2014,12 +2014,6 @@ func (ds *DistSender) sendToReplicas(
// part of routing.entry.Desc. The transport starts up initialized with
// routing's replica info, but routing can be updated as we go through the
// replicas, whereas transport isn't.
//
// TODO(andrei): The structure around here is no good; we're potentially
// updating routing with replicas that are not part of transport, and so
// those replicas will never be tried. Instead, we'll exhaust the transport
// and bubble up a SendError, which will cause a cache eviction and a new
// descriptor lookup potentially unnecessarily.
lastErr := err
if lastErr == nil && br != nil {
lastErr = br.Error.GoError()
Expand Down Expand Up @@ -2201,11 +2195,9 @@ func (ds *DistSender) sendToReplicas(

var updatedLeaseholder bool
if tErr.Lease != nil {
updatedLeaseholder = routing.UpdateLease(ctx, tErr.Lease, tErr.RangeDesc.Generation)
updatedLeaseholder = routing.SyncTokenAndMaybeUpdateCache(ctx, tErr.Lease, &tErr.RangeDesc)
} else if tErr.LeaseHolder != nil {
// tErr.LeaseHolder might be set when tErr.Lease isn't.
routing.UpdateLeaseholder(ctx, *tErr.LeaseHolder, tErr.RangeDesc.Generation)
updatedLeaseholder = true
updatedLeaseholder = routing.SyncTokenAndMaybeUpdateCacheWithSpeculativeLease(ctx, *tErr.LeaseHolder, &tErr.RangeDesc)
}
// Move the new leaseholder to the head of the queue for the next
// retry. Note that the leaseholder might not be the one indicated by
Expand All @@ -2219,7 +2211,25 @@ func (ds *DistSender) sendToReplicas(
// lease expires and someone else gets a new one, so by moving on we
// get out of possibly infinite loops.
if !lh.IsSame(curReplica) || sameReplicaRetries < sameReplicaRetryLimit {
transport.MoveToFront(*lh)
moved := transport.MoveToFront(*lh)
if !moved {
// The transport always includes the client's view of the
// leaseholder when it's constructed. If the leaseholder can't
// be found on the transport then it must be the case that the
// routing was updated with lease information that is not
// compatible with the range descriptor that was used to
// construct the transport. A client may have an arbitrarily
// stale view of the leaseholder, but it is never expected to
// regress. As such, advancing through each replica on the
// transport until it's exhausted is unlikely to achieve much.
//
// We bail early by returning a SendError. The expectation is
// for the client to retry with a fresher eviction token.
log.VEventf(
ctx, 2, "transport incompatible with updated routing; bailing early",
)
return nil, newSendError(fmt.Sprintf("leaseholder not found in transport; last error: %s", tErr.Error()))
}
}
}
// Check whether the request was intentionally sent to a follower
Expand All @@ -2234,6 +2244,15 @@ func (ds *DistSender) sendToReplicas(
// the leaseholder, we backoff because it might be the case that
// there's a lease transfer in progress and the would-be leaseholder
// has not yet applied the new lease.
//
// TODO(arul): The idea here is for the client to not keep sending
// the would-be leaseholder multiple requests and backoff a bit to let
// it apply the its lease. Instead of deriving this information like
// we do above, we could instead check if we're retrying the same
// leaseholder (i.e, if the leaseholder on the routing is the same as
// the replica we just tried), in which case we should backoff. With
// this scheme we'd no longer have to track "updatedLeaseholder" state
// when syncing the NLHE with the range cache.
shouldBackoff := !updatedLeaseholder && !intentionallySentToFollower
if shouldBackoff {
ds.metrics.InLeaseTransferBackoffs.Inc(1)
Expand Down
Loading

0 comments on commit 5fb48d6

Please sign in to comment.