-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvclient: ignore stale lease information from lagging replicas #73697
kvclient: ignore stale lease information from lagging replicas #73697
Conversation
Note: In this commit, we're always ignoring the lease information coming from replicas that have a stale view of the range descriptor, even if the lease information is compatible with the cached range descriptor. I wasn't sure if using the lease information from a lagging replica is better than simply trying the next replica in the cached range descriptor. Do y'all have thoughts on that? |
0b09afd
to
d31a069
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: In this commit, we're always ignoring the lease information coming from replicas that have a stale view of the range descriptor, even if the lease information is compatible with the cached range descriptor. I wasn't sure if using the lease information from a lagging replica is better than simply trying the next replica in the cached range descriptor. Do y'all have thoughts on that?
This is an interesting observation. It relates to a comment I was going to leave down below, but which I'm going to pull up here instead.
We're adding conditional logic to compare the information in the NotLeaseHolderError
with the information in the eviction token up in sendToReplicas
. This isn't quite how things work elsewhere. In most places, this kind of logic lives in the EvictionToken
or CacheEntry
. The methods on EvictionToken
like UpdateLease
and EvictAndReplace
deal with the messy job of trying to reconcile partial and inconsistent information into mostly-monotonic state transitions in the cache.
If we follow the model laid out there, then I think the approach we'd want to take with this change is to plumb the descriptor generation through EvictionToken.UpdateLease
and EvictionToken.UpdateLeaseholder
into CacheEntry.updateLease
. Inside CacheEntry.updateLease
, we could then update the logic that consults the range descriptor to look something like:
_, ok := e.desc.GetReplicaDescriptorByID(l.Replica.ReplicaID)
if !ok {
// ...
if gen != 0 && gen < e.desc.Generation {
return false, e
}
// ...
return true, nil
}
One major benefit of this is that it will be easier to test (in addition to the high-level test you already have).
Reviewed 3 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)
pkg/kv/kvclient/kvcoord/dist_sender.go, line 2086 at r1 (raw file):
// We cannot trust the lease information in the error if it is coming // from a replica that has a stale view of the range (compared to what // we've got in the cache).
Could you add to this comment and explain how we can get in this situation, what it means regarding the state of the replica that returned this error, and why we're careful not to evict from the cache in this case?
pkg/kv/kvclient/kvcoord/dist_sender.go, line 2094 at r1 (raw file):
) } if (tErr.Lease != nil || tErr.LeaseHolder != nil) && !withStaleRangeDesc {
nit: consider avoiding the nesting and instead splitting this into multiple conditions with break statements and explanations in each. Something like:
case *roachpb.NotLeaseHolderError:
ds.metrics.NotLeaseHolderErrCount.Inc(1)
if tErr.Lease == nil && tErr.LeaseHolder == nil {
// The replica knew it did not hold the lease, but did not provide us
// with a better alternative. Loop around and try the next replica.
break
}
if tErr.DescriptorGeneration != 0 && tErr.DescriptorGeneration < routing.Desc().Generation {
// The replica had a stale view of the range ...
break
}
// If we got some lease information, we use it to update our cache.
// Naively this would also happen when the next RPC comes back, but we
// don't want to wait out the additional RPC latency.
var updatedLeaseholder bool
pkg/kv/kvclient/kvcoord/dist_sender_test.go, line 1023 at r1 (raw file):
return reply, nil } require.Equal(t, ba.Replica.NodeID, roachpb.NodeID(1))
nit: instead of asserting the node ID here and then maintaining a callsToNode2
variable, consider creating a var calls []int
variable, append the node ID on each call to sendFn
, and then asserting that the slice looks like []int{2, 1}
at the end.
pkg/kv/kvserver/replica_range_lease.go, line 908 at r1 (raw file):
RangeID: rangeDesc.RangeID, CustomMsg: msg, DescriptorGeneration: rangeDesc.Generation,
nit: add this above CustomMsg
to keep the fields in the same order as their declaration.
pkg/roachpb/errors.proto, line 60 at r1 (raw file):
(gogoproto.customname) = "RangeID", (gogoproto.casttype) = "RangeID"]; // The range descriptor generation of the replica the error originated from. // Used to determine whether the error was returned because the replica had a
"Used by the DistSender to determine"...
022624f
to
f0bcfaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this around to do this reconciliation inside CacheEntry.updateLease()
and added a lower level test check for it.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/dist_sender.go, line 2086 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add to this comment and explain how we can get in this situation, what it means regarding the state of the replica that returned this error, and why we're careful not to evict from the cache in this case?
Added this explanation inside updateLease
.
pkg/kv/kvserver/replica_range_lease.go, line 908 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: add this above
CustomMsg
to keep the fields in the same order as their declaration.
Done.
pkg/roachpb/errors.proto, line 60 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"Used by the DistSender to determine"...
Done.
pkg/kv/kvclient/kvcoord/dist_sender_test.go, line 1023 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: instead of asserting the node ID here and then maintaining a
callsToNode2
variable, consider creating avar calls []int
variable, append the node ID on each call tosendFn
, and then asserting that the slice looks like[]int{2, 1}
at the end.
Done.
8fd22d8
to
dbf7572
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all , but before merging it I think @arulajmani should take a look.
Arul has been investigating a similar issue where the NotLeaseHolderError
comes from a replica with a newer view of the RangeDescriptor than that which is in the RangeCache. In other words, he's been exploring an issue that looks to be caused by the inverse gen > e.desc.Generation
variant of this "the lease is incompatible with the cached descriptor" check. The solution to avoiding a client-side backoff in that case may be very similar to the one here. However, this is a case where the newer RangeDescriptor
would be valuable to the range cache. So his issue may motivate a small change in this PR where we add the full RangeDescriptor
to the NotLeaseHolderError
, instead of just the RangeGeneration
. We could then use this updated RangeDescriptor
to directly update the RangeCache
, avoiding a cache eviction or meta lookup.
Reviewed 4 of 7 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)
To check my understanding, in that instance, the range cache currently thinks that the lease is compatible with the cached descriptor even when the lease is on a replica that's a cockroach/pkg/kv/kvclient/rangecache/range_cache.go Lines 1352 to 1356 in dbf7572
So, despite the fact that we're returning cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 1933 to 1938 in dbf7572
Is this what's happening in that issue? |
In that issue we see backoffs for every replica that we try, and as backoffs are predicated on cockroach/pkg/kv/kvclient/rangecache/range_cache.go Lines 387 to 394 in 01cb847
We end up trying all replicas because the leasheholder on the routing is not a part of the transport, so I think the second TODO you linked above does partially apply. It's this logic here: cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 2114 to 2126 in d0b0d45
We can trace this through together if you want. If we detect this I'll review this in a bit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me as well. I was thinking a bit more about the gen > e.desc.Generation
case a bit more and we may not need to ship the entire RangeDescriptor
on the NLHE to make it work. As long as we can identify that we're in that scenario, avoid backing off on the client, and retry, we might be in the clear -- this is because how we construct the slice of replicas to create the transport with:
cockroach/pkg/kv/kvclient/kvcoord/replica_slice.go
Lines 61 to 67 in acdda53
// Generally, learners are not returned. However, if a non-nil leaseholder is | |
// passed in, it will be included in the result even if the descriptor has it as | |
// a learner (we assert that the leaseholder is part of the descriptor). The | |
// idea is that the descriptor might be stale and list the leaseholder as a | |
// learner erroneously, and lease info is a strong signal in that direction. | |
// Note that the returned ReplicaSlice might still not include the leaseholder | |
// if info for the respective node is missing from the NodeDescStore. |
The targeted approach there would be to identify this scenario and make sure the lease information in the RangeCache
is up-to-date (which is already happening) before we retry. That being said, it might still be beneficial to ship the entire RangeDescriptor
and opportunistically update the RangeCache
when we can. I'll let you guys decide if we should do this as part of this patch or not.
Reviewed 1 of 7 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)
pkg/kv/kvclient/rangecache/range_cache.go, line 1310 at r3 (raw file):
// entry. // // If the new leaseholder is not a replica in the descriptor, we assume the
Should we re-word this too given the change?
pkg/kv/kvclient/rangecache/range_cache.go, line 1336 at r3 (raw file):
// If the lease is incompatible with the cached descriptor and the error is // coming from a replica that has a non-stale descriptor, the cached
nit: s/non-stale/newer/
This commit makes it such that the `DistSender`'s range descriptor cache doesn't trigger a cache eviction based on incompatible lease information in a `NotLeaseHolderError` when it is coming from a replica that has a stale view of the range's descriptor (characterized by an older `DescriptorGeneration` on the replica) Not doing so before was hazardous because, if we received an NLHE that pointed to a replica that did not belong in the cached descriptor, we'd trigger a cache evicion. This assumed that the replica returning the error had a fresher view of the range than what we had in the cache, which is not always true. This meant that we'd keep doing range lookups and subsequent evictions until this lagging replica caught up to the current state of the range. Release note (bug fix): A bug that caused high SQL tail latencies during background rebalancing in the cluster has been fixed.
dbf7572
to
92f0260
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arulajmani and I were on a call to stare at the code and understand how things were going wrong in that issue he was investigating. We have a better understanding of the issue now (at least I think Arul does), but there's enough going on here that we think it should be discussed separately in its own issue / patch.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, @arulajmani, and @nvanbenschoten)
pkg/kv/kvclient/rangecache/range_cache.go, line 1310 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Should we re-word this too given the change?
Done.
pkg/kv/kvclient/rangecache/range_cache.go, line 1336 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: s/non-stale/newer/
I do mean non-stale
here. We don't need the error to come from a replica that has a newer descriptor, it just needs to not be older than the one in the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @andreimatei, @arulajmani, and @nvanbenschoten)
pkg/kv/kvclient/rangecache/range_cache.go, line 1336 at r3 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
I do mean
non-stale
here. We don't need the error to come from a replica that has a newer descriptor, it just needs to not be older than the one in the cache.
I think we're saying the same thing, but I was reading non-stale
as "newest" in absolute terms.
TFTR!! bors r+ |
Build failed (retrying...): |
Build succeeded: |
Should we backport this to 21.1 and 21.2, or should we wait until we have a fix for the other issue we were discussing on this PR? |
We don't need to wait for a fix to the other issue, but we should wait until we decide whether we plan to change the |
This commit makes it such that the DistSender ignores the lease information
returned in
NotLeaseHolderError
s coming from replicas that have stale view ofthe range descriptor (characterized by an older
DescriptorGeneration
on thereplica).
Not doing so before was hazardous because, if we received an NLHE that pointed
to a replica that did not belong in the cached descriptor, we'd trigger a cache
evicion. This assumed that the replica returning the error had a fresher view
of the range than what we had in the cache, which is not always true. This
meant that we'd keep doing range lookups and subsequent evictions until this
lagging replica caught up to the current state of the range.
Fixes #72772
Release note (bug fix): A bug that caused high SQL tail latencies during
background rebalancing in the cluster has been fixed.