Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvclient: ignore stale lease information from lagging replicas #73697

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented Dec 10, 2021

This commit makes it such that the DistSender ignores the lease information
returned in NotLeaseHolderErrors coming from replicas that have stale view of
the range descriptor (characterized by an older DescriptorGeneration on the
replica).

Not doing so before was hazardous because, if we received an NLHE that pointed
to a replica that did not belong in the cached descriptor, we'd trigger a cache
evicion. This assumed that the replica returning the error had a fresher view
of the range than what we had in the cache, which is not always true. This
meant that we'd keep doing range lookups and subsequent evictions until this
lagging replica caught up to the current state of the range.

Fixes #72772

Release note (bug fix): A bug that caused high SQL tail latencies during
background rebalancing in the cluster has been fixed.

@aayushshah15 aayushshah15 requested a review from a team as a code owner December 10, 2021 21:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15
Copy link
Contributor Author

Note: In this commit, we're always ignoring the lease information coming from replicas that have a stale view of the range descriptor, even if the lease information is compatible with the cached range descriptor. I wasn't sure if using the lease information from a lagging replica is better than simply trying the next replica in the cached range descriptor. Do y'all have thoughts on that?

@aayushshah15 aayushshah15 force-pushed the 20211209_fixRangeCacheThrashingDuringRebalance branch 3 times, most recently from 0b09afd to d31a069 Compare December 12, 2021 06:56
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: In this commit, we're always ignoring the lease information coming from replicas that have a stale view of the range descriptor, even if the lease information is compatible with the cached range descriptor. I wasn't sure if using the lease information from a lagging replica is better than simply trying the next replica in the cached range descriptor. Do y'all have thoughts on that?

This is an interesting observation. It relates to a comment I was going to leave down below, but which I'm going to pull up here instead.

We're adding conditional logic to compare the information in the NotLeaseHolderError with the information in the eviction token up in sendToReplicas. This isn't quite how things work elsewhere. In most places, this kind of logic lives in the EvictionToken or CacheEntry. The methods on EvictionToken like UpdateLease and EvictAndReplace deal with the messy job of trying to reconcile partial and inconsistent information into mostly-monotonic state transitions in the cache.

If we follow the model laid out there, then I think the approach we'd want to take with this change is to plumb the descriptor generation through EvictionToken.UpdateLease and EvictionToken.UpdateLeaseholder into CacheEntry.updateLease. Inside CacheEntry.updateLease, we could then update the logic that consults the range descriptor to look something like:

_, ok := e.desc.GetReplicaDescriptorByID(l.Replica.ReplicaID)
if !ok {
    // ...
    if gen != 0 && gen < e.desc.Generation {
        return false, e
    }
    // ...
    return true, nil
}

One major benefit of this is that it will be easier to test (in addition to the high-level test you already have).

Reviewed 3 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvclient/kvcoord/dist_sender.go, line 2086 at r1 (raw file):

				// We cannot trust the lease information in the error if it is coming
				// from a replica that has a stale view of the range (compared to what
				// we've got in the cache).

Could you add to this comment and explain how we can get in this situation, what it means regarding the state of the replica that returned this error, and why we're careful not to evict from the cache in this case?


pkg/kv/kvclient/kvcoord/dist_sender.go, line 2094 at r1 (raw file):

					)
				}
				if (tErr.Lease != nil || tErr.LeaseHolder != nil) && !withStaleRangeDesc {

nit: consider avoiding the nesting and instead splitting this into multiple conditions with break statements and explanations in each. Something like:

case *roachpb.NotLeaseHolderError:
    ds.metrics.NotLeaseHolderErrCount.Inc(1)

    if tErr.Lease == nil && tErr.LeaseHolder == nil {
        // The replica knew it did not hold the lease, but did not provide us
        // with a better alternative. Loop around and try the next replica.
        break
    }

    if tErr.DescriptorGeneration != 0 && tErr.DescriptorGeneration < routing.Desc().Generation {
        // The replica had a stale view of the range ...
        break
    }

    // If we got some lease information, we use it to update our cache.
    // Naively this would also happen when the next RPC comes back, but we
    // don't want to wait out the additional RPC latency.
    var updatedLeaseholder bool

pkg/kv/kvclient/kvcoord/dist_sender_test.go, line 1023 at r1 (raw file):

			return reply, nil
		}
		require.Equal(t, ba.Replica.NodeID, roachpb.NodeID(1))

nit: instead of asserting the node ID here and then maintaining a callsToNode2 variable, consider creating a var calls []int variable, append the node ID on each call to sendFn, and then asserting that the slice looks like []int{2, 1} at the end.


pkg/kv/kvserver/replica_range_lease.go, line 908 at r1 (raw file):

		RangeID:              rangeDesc.RangeID,
		CustomMsg:            msg,
		DescriptorGeneration: rangeDesc.Generation,

nit: add this above CustomMsg to keep the fields in the same order as their declaration.


pkg/roachpb/errors.proto, line 60 at r1 (raw file):

      (gogoproto.customname) = "RangeID", (gogoproto.casttype) = "RangeID"];
  // The range descriptor generation of the replica the error originated from.
  // Used to determine whether the error was returned because the replica had a

"Used by the DistSender to determine"...

@aayushshah15 aayushshah15 force-pushed the 20211209_fixRangeCacheThrashingDuringRebalance branch 3 times, most recently from 022624f to f0bcfaa Compare January 17, 2022 10:59
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this around to do this reconciliation inside CacheEntry.updateLease() and added a lower level test check for it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/dist_sender.go, line 2086 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add to this comment and explain how we can get in this situation, what it means regarding the state of the replica that returned this error, and why we're careful not to evict from the cache in this case?

Added this explanation inside updateLease.


pkg/kv/kvserver/replica_range_lease.go, line 908 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: add this above CustomMsg to keep the fields in the same order as their declaration.

Done.


pkg/roachpb/errors.proto, line 60 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"Used by the DistSender to determine"...

Done.


pkg/kv/kvclient/kvcoord/dist_sender_test.go, line 1023 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: instead of asserting the node ID here and then maintaining a callsToNode2 variable, consider creating a var calls []int variable, append the node ID on each call to sendFn, and then asserting that the slice looks like []int{2, 1} at the end.

Done.

@aayushshah15 aayushshah15 force-pushed the 20211209_fixRangeCacheThrashingDuringRebalance branch 2 times, most recently from 8fd22d8 to dbf7572 Compare January 18, 2022 18:34
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all :lgtm:, but before merging it I think @arulajmani should take a look.

Arul has been investigating a similar issue where the NotLeaseHolderError comes from a replica with a newer view of the RangeDescriptor than that which is in the RangeCache. In other words, he's been exploring an issue that looks to be caused by the inverse gen > e.desc.Generation variant of this "the lease is incompatible with the cached descriptor" check. The solution to avoiding a client-side backoff in that case may be very similar to the one here. However, this is a case where the newer RangeDescriptor would be valuable to the range cache. So his issue may motivate a small change in this PR where we add the full RangeDescriptor to the NotLeaseHolderError, instead of just the RangeGeneration. We could then use this updated RangeDescriptor to directly update the RangeCache, avoiding a cache eviction or meta lookup.

Reviewed 4 of 7 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

@aayushshah15
Copy link
Contributor Author

In other words, he's been exploring an issue that looks to be caused by the inverse gen > e.desc.Generation variant of this "the lease is incompatible with the cached descriptor" check

To check my understanding, in that instance, the range cache currently thinks that the lease is compatible with the cached descriptor even when the lease is on a replica that's a LEARNER in the cached desc.

// TODO(andrei): If the leaseholder is present, but the descriptor lists the
// replica as a learner, this is a sign of a stale descriptor. I'm not sure
// what to do about it, though.

So, despite the fact that we're returning true from EvictionToken.UpdateLease(), since the LEARNER replica isn't part of the transport, we end up trying all the other replicas, and then finally do a cache eviction.

//
// TODO(andrei): The structure around here is no good; we're potentially
// updating routing with replicas that are not part of transport, and so
// those replicas will never be tried. Instead, we'll exhaust the transport
// and bubble up a SendError, which will cause a cache eviction and a new
// descriptor lookup potentially unnecessarily.

Is this what's happening in that issue?

@arulajmani
Copy link
Collaborator

So, despite the fact that we're returning true from EvictionToken.UpdateLease(), since the LEARNER replica isn't part of the transport, we end up trying all the other replicas, and then finally do a cache eviction.

In that issue we see backoffs for every replica that we try, and as backoffs are predicated on EvictionToken.UpdateLeases() returning false, I don't think we end up reaching the first TODO you've linked above. I think what's happening here is that the RangeCache has been updated with the correct lease information between the time we constructed the transport and got the NLHE error. When we sync the eviction token we update it with the new lease information. Then, when we try to update the lease with the one returned by the NLHE, we think they're the same and return false (in updateLease).

stillValid, cachedEntry, rawEntry := et.syncRLocked(ctx)
if !stillValid {
return false
}
ok, newEntry := cachedEntry.updateLease(l)
if !ok {
return false
}

We end up trying all replicas because the leasheholder on the routing is not a part of the transport, so I think the second TODO you linked above does partially apply. It's this logic here:

// Move the new leaseholder to the head of the queue for the next
// retry. Note that the leaseholder might not be the one indicated by
// the NLHE we just received, in case that error carried stale info.
if lh := routing.Leaseholder(); lh != nil {
// If the leaseholder is the replica that we've just tried, and
// we've tried this replica a bunch of times already, let's move on
// and not try it again. This prevents us getting stuck on a replica
// that we think has the lease but keeps returning redirects to us
// (possibly because it hasn't applied its lease yet). Perhaps that
// lease expires and someone else gets a new one, so by moving on we
// get out of possibly infinite loops.
if *lh != curReplica || sameReplicaRetries < sameReplicaRetryLimit {
transport.MoveToFront(*lh)

We can trace this through together if you want. If we detect this gen > e.desc.Generation issue, we can short circuit trying all the replicas in the transport, update the range descriptor in the range cache, and retry with a new transport constructed off the new lease information + descriptor.

I'll review this in a bit.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me as well. I was thinking a bit more about the gen > e.desc.Generation case a bit more and we may not need to ship the entire RangeDescriptor on the NLHE to make it work. As long as we can identify that we're in that scenario, avoid backing off on the client, and retry, we might be in the clear -- this is because how we construct the slice of replicas to create the transport with:

// Generally, learners are not returned. However, if a non-nil leaseholder is
// passed in, it will be included in the result even if the descriptor has it as
// a learner (we assert that the leaseholder is part of the descriptor). The
// idea is that the descriptor might be stale and list the leaseholder as a
// learner erroneously, and lease info is a strong signal in that direction.
// Note that the returned ReplicaSlice might still not include the leaseholder
// if info for the respective node is missing from the NodeDescStore.

The targeted approach there would be to identify this scenario and make sure the lease information in the RangeCache is up-to-date (which is already happening) before we retry. That being said, it might still be beneficial to ship the entire RangeDescriptor and opportunistically update the RangeCache when we can. I'll let you guys decide if we should do this as part of this patch or not.

Reviewed 1 of 7 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvclient/rangecache/range_cache.go, line 1310 at r3 (raw file):

// entry.
//
// If the new leaseholder is not a replica in the descriptor, we assume the

Should we re-word this too given the change?


pkg/kv/kvclient/rangecache/range_cache.go, line 1336 at r3 (raw file):

	// If the lease is incompatible with the cached descriptor and the error is
	// coming from a replica that has a non-stale descriptor, the cached

nit: s/non-stale/newer/

This commit makes it such that the `DistSender`'s range descriptor cache
doesn't trigger a cache eviction based on incompatible lease information in a
`NotLeaseHolderError` when it is coming from a replica that has a stale view of
the range's descriptor (characterized by an older `DescriptorGeneration` on the
replica)

Not doing so before was hazardous because, if we received an NLHE that pointed
to a replica that did not belong in the cached descriptor, we'd trigger a cache
evicion. This assumed that the replica returning the error had a fresher view
of the range than what we had in the cache, which is not always true. This
meant that we'd keep doing range lookups and subsequent evictions until this
lagging replica caught up to the current state of the range.

Release note (bug fix): A bug that caused high SQL tail latencies during
background rebalancing in the cluster has been fixed.
@aayushshah15 aayushshah15 force-pushed the 20211209_fixRangeCacheThrashingDuringRebalance branch from dbf7572 to 92f0260 Compare January 27, 2022 18:44
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arulajmani and I were on a call to stare at the code and understand how things were going wrong in that issue he was investigating. We have a better understanding of the issue now (at least I think Arul does), but there's enough going on here that we think it should be discussed separately in its own issue / patch.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, @arulajmani, and @nvanbenschoten)


pkg/kv/kvclient/rangecache/range_cache.go, line 1310 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we re-word this too given the change?

Done.


pkg/kv/kvclient/rangecache/range_cache.go, line 1336 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: s/non-stale/newer/

I do mean non-stale here. We don't need the error to come from a replica that has a newer descriptor, it just needs to not be older than the one in the cache.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @andreimatei, @arulajmani, and @nvanbenschoten)


pkg/kv/kvclient/rangecache/range_cache.go, line 1336 at r3 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I do mean non-stale here. We don't need the error to come from a replica that has a newer descriptor, it just needs to not be older than the one in the cache.

I think we're saying the same thing, but I was reading non-stale as "newest" in absolute terms.

@aayushshah15
Copy link
Contributor Author

TFTR!!

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 27, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 28, 2022

Build succeeded:

@craig craig bot merged commit ba04380 into cockroachdb:master Jan 28, 2022
@aayushshah15
Copy link
Contributor Author

aayushshah15 commented Jan 31, 2022

Should we backport this to 21.1 and 21.2, or should we wait until we have a fix for the other issue we were discussing on this PR?

@nvanbenschoten
Copy link
Member

We don't need to wait for a fix to the other issue, but we should wait until we decide whether we plan to change the NotLeaseHolderError proto to carry a RangeDescriptor instead of a DescriptorGeneration. If so, let's at least land that change before backporting. It would be a shame to introduce a compatibility burden just because of this intermediate state. In fact, let's make sure we're not running afoul of any v22.1 alphas while this sits on master in the interim period between changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvclient: range descriptor cache can thrash during range rebalancing
4 participants