kvclient: ignore stale lease information from lagging replicas #73697

aayushshah15 · 2021-12-10T21:25:33Z

This commit makes it such that the DistSender ignores the lease information
returned in NotLeaseHolderErrors coming from replicas that have stale view of
the range descriptor (characterized by an older DescriptorGeneration on the
replica).

Not doing so before was hazardous because, if we received an NLHE that pointed
to a replica that did not belong in the cached descriptor, we'd trigger a cache
evicion. This assumed that the replica returning the error had a fresher view
of the range than what we had in the cache, which is not always true. This
meant that we'd keep doing range lookups and subsequent evictions until this
lagging replica caught up to the current state of the range.

Fixes #72772

Release note (bug fix): A bug that caused high SQL tail latencies during
background rebalancing in the cluster has been fixed.

cockroach-teamcity · 2021-12-10T21:25:40Z

This change is

aayushshah15 · 2021-12-10T21:36:06Z

Note: In this commit, we're always ignoring the lease information coming from replicas that have a stale view of the range descriptor, even if the lease information is compatible with the cached range descriptor. I wasn't sure if using the lease information from a lagging replica is better than simply trying the next replica in the cached range descriptor. Do y'all have thoughts on that?

nvanbenschoten

Note: In this commit, we're always ignoring the lease information coming from replicas that have a stale view of the range descriptor, even if the lease information is compatible with the cached range descriptor. I wasn't sure if using the lease information from a lagging replica is better than simply trying the next replica in the cached range descriptor. Do y'all have thoughts on that?

This is an interesting observation. It relates to a comment I was going to leave down below, but which I'm going to pull up here instead.

We're adding conditional logic to compare the information in the NotLeaseHolderError with the information in the eviction token up in sendToReplicas. This isn't quite how things work elsewhere. In most places, this kind of logic lives in the EvictionToken or CacheEntry. The methods on EvictionToken like UpdateLease and EvictAndReplace deal with the messy job of trying to reconcile partial and inconsistent information into mostly-monotonic state transitions in the cache.

If we follow the model laid out there, then I think the approach we'd want to take with this change is to plumb the descriptor generation through EvictionToken.UpdateLease and EvictionToken.UpdateLeaseholder into CacheEntry.updateLease. Inside CacheEntry.updateLease, we could then update the logic that consults the range descriptor to look something like:

_, ok := e.desc.GetReplicaDescriptorByID(l.Replica.ReplicaID)
if !ok {
    // ...
    if gen != 0 && gen < e.desc.Generation {
        return false, e
    }
    // ...
    return true, nil
}

One major benefit of this is that it will be easier to test (in addition to the high-level test you already have).

Reviewed 3 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)

pkg/kv/kvclient/kvcoord/dist_sender.go, line 2086 at r1 (raw file):

				// We cannot trust the lease information in the error if it is coming
				// from a replica that has a stale view of the range (compared to what
				// we've got in the cache).

Could you add to this comment and explain how we can get in this situation, what it means regarding the state of the replica that returned this error, and why we're careful not to evict from the cache in this case?

pkg/kv/kvclient/kvcoord/dist_sender.go, line 2094 at r1 (raw file):

					)
				}
				if (tErr.Lease != nil || tErr.LeaseHolder != nil) && !withStaleRangeDesc {

nit: consider avoiding the nesting and instead splitting this into multiple conditions with break statements and explanations in each. Something like:

case *roachpb.NotLeaseHolderError:
    ds.metrics.NotLeaseHolderErrCount.Inc(1)

    if tErr.Lease == nil && tErr.LeaseHolder == nil {
        // The replica knew it did not hold the lease, but did not provide us
        // with a better alternative. Loop around and try the next replica.
        break
    }

    if tErr.DescriptorGeneration != 0 && tErr.DescriptorGeneration < routing.Desc().Generation {
        // The replica had a stale view of the range ...
        break
    }

    // If we got some lease information, we use it to update our cache.
    // Naively this would also happen when the next RPC comes back, but we
    // don't want to wait out the additional RPC latency.
    var updatedLeaseholder bool

pkg/kv/kvclient/kvcoord/dist_sender_test.go, line 1023 at r1 (raw file):

			return reply, nil
		}
		require.Equal(t, ba.Replica.NodeID, roachpb.NodeID(1))

nit: instead of asserting the node ID here and then maintaining a callsToNode2 variable, consider creating a var calls []int variable, append the node ID on each call to sendFn, and then asserting that the slice looks like []int{2, 1} at the end.

pkg/kv/kvserver/replica_range_lease.go, line 908 at r1 (raw file):

		RangeID:              rangeDesc.RangeID,
		CustomMsg:            msg,
		DescriptorGeneration: rangeDesc.Generation,

nit: add this above CustomMsg to keep the fields in the same order as their declaration.

pkg/roachpb/errors.proto, line 60 at r1 (raw file):

      (gogoproto.customname) = "RangeID", (gogoproto.casttype) = "RangeID"];
  // The range descriptor generation of the replica the error originated from.
  // Used to determine whether the error was returned because the replica had a

"Used by the DistSender to determine"...

aayushshah15

I changed this around to do this reconciliation inside CacheEntry.updateLease() and added a lower level test check for it.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender.go, line 2086 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add to this comment and explain how we can get in this situation, what it means regarding the state of the replica that returned this error, and why we're careful not to evict from the cache in this case?

Added this explanation inside updateLease.

pkg/kv/kvserver/replica_range_lease.go, line 908 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: add this above CustomMsg to keep the fields in the same order as their declaration.

Done.

pkg/roachpb/errors.proto, line 60 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"Used by the DistSender to determine"...

Done.

pkg/kv/kvclient/kvcoord/dist_sender_test.go, line 1023 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: instead of asserting the node ID here and then maintaining a callsToNode2 variable, consider creating a var calls []int variable, append the node ID on each call to sendFn, and then asserting that the slice looks like []int{2, 1} at the end.

Done.

nvanbenschoten

This all , but before merging it I think @arulajmani should take a look.

Arul has been investigating a similar issue where the NotLeaseHolderError comes from a replica with a newer view of the RangeDescriptor than that which is in the RangeCache. In other words, he's been exploring an issue that looks to be caused by the inverse gen > e.desc.Generation variant of this "the lease is incompatible with the cached descriptor" check. The solution to avoiding a client-side backoff in that case may be very similar to the one here. However, this is a case where the newer RangeDescriptor would be valuable to the range cache. So his issue may motivate a small change in this PR where we add the full RangeDescriptor to the NotLeaseHolderError, instead of just the RangeGeneration. We could then use this updated RangeDescriptor to directly update the RangeCache, avoiding a cache eviction or meta lookup.

Reviewed 4 of 7 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

aayushshah15 · 2022-01-24T16:25:19Z

In other words, he's been exploring an issue that looks to be caused by the inverse gen > e.desc.Generation variant of this "the lease is incompatible with the cached descriptor" check

To check my understanding, in that instance, the range cache currently thinks that the lease is compatible with the cached descriptor even when the lease is on a replica that's a LEARNER in the cached desc.

cockroach/pkg/kv/kvclient/rangecache/range_cache.go

Lines 1352 to 1356 in dbf7572

    
           // TODO(andrei): If the leaseholder is present, but the descriptor lists the 
        
           // replica as a learner, this is a sign of a stale descriptor. I'm not sure 
        
           // what to do about it, though.

So, despite the fact that we're returning true from EvictionToken.UpdateLease(), since the LEARNER replica isn't part of the transport, we end up trying all the other replicas, and then finally do a cache eviction.

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1933 to 1938 in dbf7572

    
           // 
        
           // TODO(andrei): The structure around here is no good; we're potentially 
        
           // updating routing with replicas that are not part of transport, and so 
        
           // those replicas will never be tried. Instead, we'll exhaust the transport 
        
           // and bubble up a SendError, which will cause a cache eviction and a new 
        
           // descriptor lookup potentially unnecessarily.

Is this what's happening in that issue?

arulajmani · 2022-01-24T17:19:53Z

So, despite the fact that we're returning true from EvictionToken.UpdateLease(), since the LEARNER replica isn't part of the transport, we end up trying all the other replicas, and then finally do a cache eviction.

In that issue we see backoffs for every replica that we try, and as backoffs are predicated on EvictionToken.UpdateLeases() returning false, I don't think we end up reaching the first TODO you've linked above. I think what's happening here is that the RangeCache has been updated with the correct lease information between the time we constructed the transport and got the NLHE error. When we sync the eviction token we update it with the new lease information. Then, when we try to update the lease with the one returned by the NLHE, we think they're the same and return false (in updateLease).

cockroach/pkg/kv/kvclient/rangecache/range_cache.go

Lines 387 to 394 in 01cb847

    
           stillValid, cachedEntry, rawEntry := et.syncRLocked(ctx) 
        
           if !stillValid { 
        
           	return false 
        
           } 
        
           ok, newEntry := cachedEntry.updateLease(l) 
        
           if !ok { 
        
           	return false 
        
           }

We end up trying all replicas because the leasheholder on the routing is not a part of the transport, so I think the second TODO you linked above does partially apply. It's this logic here:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 2114 to 2126 in d0b0d45

    
           // Move the new leaseholder to the head of the queue for the next 
        
           // retry. Note that the leaseholder might not be the one indicated by 
        
           // the NLHE we just received, in case that error carried stale info. 
        
           if lh := routing.Leaseholder(); lh != nil { 
        
           	// If the leaseholder is the replica that we've just tried, and 
        
           	// we've tried this replica a bunch of times already, let's move on 
        
           	// and not try it again. This prevents us getting stuck on a replica 
        
           	// that we think has the lease but keeps returning redirects to us 
        
           	// (possibly because it hasn't applied its lease yet). Perhaps that 
        
           	// lease expires and someone else gets a new one, so by moving on we 
        
           	// get out of possibly infinite loops. 
        
           	if *lh != curReplica || sameReplicaRetries < sameReplicaRetryLimit { 
        
           		transport.MoveToFront(*lh)

We can trace this through together if you want. If we detect this gen > e.desc.Generation issue, we can short circuit trying all the replicas in the transport, update the range descriptor in the range cache, and retry with a new transport constructed off the new lease information + descriptor.

I'll review this in a bit.

arulajmani

The change looks good to me as well. I was thinking a bit more about the gen > e.desc.Generation case a bit more and we may not need to ship the entire RangeDescriptor on the NLHE to make it work. As long as we can identify that we're in that scenario, avoid backing off on the client, and retry, we might be in the clear -- this is because how we construct the slice of replicas to create the transport with:

cockroach/pkg/kv/kvclient/kvcoord/replica_slice.go

Lines 61 to 67 in acdda53

    
           // Generally, learners are not returned. However, if a non-nil leaseholder is 
        
           // passed in, it will be included in the result even if the descriptor has it as 
        
           // a learner (we assert that the leaseholder is part of the descriptor). The 
        
           // idea is that the descriptor might be stale and list the leaseholder as a 
        
           // learner erroneously, and lease info is a strong signal in that direction. 
        
           // Note that the returned ReplicaSlice might still not include the leaseholder 
        
           // if info for the respective node is missing from the NodeDescStore.

The targeted approach there would be to identify this scenario and make sure the lease information in the RangeCache is up-to-date (which is already happening) before we retry. That being said, it might still be beneficial to ship the entire RangeDescriptor and opportunistically update the RangeCache when we can. I'll let you guys decide if we should do this as part of this patch or not.

Reviewed 1 of 7 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

pkg/kv/kvclient/rangecache/range_cache.go, line 1310 at r3 (raw file):

// entry.
//
// If the new leaseholder is not a replica in the descriptor, we assume the

Should we re-word this too given the change?

pkg/kv/kvclient/rangecache/range_cache.go, line 1336 at r3 (raw file):

	// If the lease is incompatible with the cached descriptor and the error is
	// coming from a replica that has a non-stale descriptor, the cached

nit: s/non-stale/newer/

This commit makes it such that the `DistSender`'s range descriptor cache doesn't trigger a cache eviction based on incompatible lease information in a `NotLeaseHolderError` when it is coming from a replica that has a stale view of the range's descriptor (characterized by an older `DescriptorGeneration` on the replica) Not doing so before was hazardous because, if we received an NLHE that pointed to a replica that did not belong in the cached descriptor, we'd trigger a cache evicion. This assumed that the replica returning the error had a fresher view of the range than what we had in the cache, which is not always true. This meant that we'd keep doing range lookups and subsequent evictions until this lagging replica caught up to the current state of the range. Release note (bug fix): A bug that caused high SQL tail latencies during background rebalancing in the cluster has been fixed.

aayushshah15

@arulajmani and I were on a call to stare at the code and understand how things were going wrong in that issue he was investigating. We have a better understanding of the issue now (at least I think Arul does), but there's enough going on here that we think it should be discussed separately in its own issue / patch.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, @arulajmani, and @nvanbenschoten)

pkg/kv/kvclient/rangecache/range_cache.go, line 1310 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we re-word this too given the change?

Done.

pkg/kv/kvclient/rangecache/range_cache.go, line 1336 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: s/non-stale/newer/

I do mean non-stale here. We don't need the error to come from a replica that has a newer descriptor, it just needs to not be older than the one in the cache.

arulajmani

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @andreimatei, @arulajmani, and @nvanbenschoten)

pkg/kv/kvclient/rangecache/range_cache.go, line 1336 at r3 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I do mean non-stale here. We don't need the error to come from a replica that has a newer descriptor, it just needs to not be older than the one in the cache.

I think we're saying the same thing, but I was reading non-stale as "newest" in absolute terms.

aayushshah15 · 2022-01-27T19:15:56Z

TFTR!!

bors r+

craig · 2022-01-27T21:03:06Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-01-28T01:25:00Z

Build succeeded:

GitHub CI (Cockroach)

aayushshah15 · 2022-01-31T00:28:31Z

Should we backport this to 21.1 and 21.2, or should we wait until we have a fix for the other issue we were discussing on this PR?

nvanbenschoten · 2022-01-31T15:29:47Z

We don't need to wait for a fix to the other issue, but we should wait until we decide whether we plan to change the NotLeaseHolderError proto to carry a RangeDescriptor instead of a DescriptorGeneration. If so, let's at least land that change before backporting. It would be a shame to introduce a compatibility burden just because of this intermediate state. In fact, let's make sure we're not running afoul of any v22.1 alphas while this sits on master in the interim period between changes.

aayushshah15 requested review from andreimatei and nvanbenschoten December 10, 2021 21:25

aayushshah15 requested a review from a team as a code owner December 10, 2021 21:25

aayushshah15 force-pushed the 20211209_fixRangeCacheThrashingDuringRebalance branch 3 times, most recently from 0b09afd to d31a069 Compare December 12, 2021 06:56

nvanbenschoten reviewed Dec 14, 2021

View reviewed changes

aayushshah15 force-pushed the 20211209_fixRangeCacheThrashingDuringRebalance branch 3 times, most recently from 022624f to f0bcfaa Compare January 17, 2022 10:59

aayushshah15 commented Jan 17, 2022

View reviewed changes

aayushshah15 force-pushed the 20211209_fixRangeCacheThrashingDuringRebalance branch 2 times, most recently from 8fd22d8 to dbf7572 Compare January 18, 2022 18:34

nvanbenschoten reviewed Jan 24, 2022

View reviewed changes

aayushshah15 requested a review from arulajmani January 24, 2022 16:05

arulajmani reviewed Jan 24, 2022

View reviewed changes

aayushshah15 force-pushed the 20211209_fixRangeCacheThrashingDuringRebalance branch from dbf7572 to 92f0260 Compare January 27, 2022 18:44

aayushshah15 commented Jan 27, 2022

View reviewed changes

arulajmani approved these changes Jan 27, 2022

View reviewed changes

craig bot merged commit ba04380 into cockroachdb:master Jan 28, 2022

arulajmani mentioned this pull request Jan 31, 2022

kvclient: requests can spuriously backoff due to stale range descriptors #75742

Closed

This was referenced Jun 13, 2022

kvclient: ship range descriptor in NLHE #76756

Merged

kv: uninitialized replica can cause unbounded backoff due to stale range descriptors #82802

Closed

aayushshah15 mentioned this pull request Sep 26, 2022

release-21.2: kvclient: ignore stale lease information from lagging replicas #88738

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvclient: ignore stale lease information from lagging replicas #73697

kvclient: ignore stale lease information from lagging replicas #73697

aayushshah15 commented Dec 10, 2021 •

edited

Loading

cockroach-teamcity commented Dec 10, 2021

aayushshah15 commented Dec 10, 2021

nvanbenschoten left a comment

aayushshah15 left a comment

nvanbenschoten left a comment

aayushshah15 commented Jan 24, 2022

arulajmani commented Jan 24, 2022

arulajmani left a comment

aayushshah15 left a comment

arulajmani left a comment

aayushshah15 commented Jan 27, 2022

craig bot commented Jan 27, 2022

craig bot commented Jan 28, 2022

aayushshah15 commented Jan 31, 2022 •

edited

Loading

nvanbenschoten commented Jan 31, 2022

	// Generally, learners are not returned. However, if a non-nil leaseholder is
	// passed in, it will be included in the result even if the descriptor has it as
	// a learner (we assert that the leaseholder is part of the descriptor). The
	// idea is that the descriptor might be stale and list the leaseholder as a
	// learner erroneously, and lease info is a strong signal in that direction.
	// Note that the returned ReplicaSlice might still not include the leaseholder
	// if info for the respective node is missing from the NodeDescStore.

kvclient: ignore stale lease information from lagging replicas #73697

kvclient: ignore stale lease information from lagging replicas #73697

Conversation

aayushshah15 commented Dec 10, 2021 • edited Loading

cockroach-teamcity commented Dec 10, 2021

aayushshah15 commented Dec 10, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

aayushshah15 commented Jan 24, 2022

arulajmani commented Jan 24, 2022

arulajmani left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

aayushshah15 commented Jan 27, 2022

craig bot commented Jan 27, 2022

craig bot commented Jan 28, 2022

aayushshah15 commented Jan 31, 2022 • edited Loading

nvanbenschoten commented Jan 31, 2022

aayushshah15 commented Dec 10, 2021 •

edited

Loading

aayushshah15 commented Jan 31, 2022 •

edited

Loading