Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: incorrect DistSender leaseholder processing for non-VOTERs #85060

Closed
erikgrinaker opened this issue Jul 26, 2022 · 6 comments · Fixed by #85140
Closed

kvserver: incorrect DistSender leaseholder processing for non-VOTERs #85060

erikgrinaker opened this issue Jul 26, 2022 · 6 comments · Fixed by #85140
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 26, 2022

ReplicaDescriptor.Type is a pointer rather than a scalar value:

optional ReplicaType type = 4 [(gogoproto.moretags) = "yaml:\"ReplicaType,omitempty\""];

We often use regular equality comparisons on replica descriptors, notably in the DistSender and gRPC transport:

if *lh != curReplica || sameReplicaRetries < sameReplicaRetryLimit {

if gt.replicas[i] == replica {

But recall that equality for pointers will check that the memory address is the same, rather than that the value is the same. Thus two identical replica descriptors with a type different from VOTER (the zero value) will be considered unequal if they are stored in different memory locations.

One immediate effect of this is that the DistSender will not prioritize new leaseholders with a VOTER_INCOMING type. In #74546, VOTER_INCOMING was allowed to receive the lease and thus included in the DistSender's replica list with the RoutingPolicy_LEASEHOLDER policy:

// Filter the replicas to only those that are relevant to the routing policy.
var replicaFilter ReplicaSliceFilter
switch ba.RoutingPolicy {
case roachpb.RoutingPolicy_LEASEHOLDER:
replicaFilter = OnlyPotentialLeaseholders
case roachpb.RoutingPolicy_NEAREST:
replicaFilter = AllExtantReplicas
default:
log.Fatalf(ctx, "unknown routing policy: %s", ba.RoutingPolicy)
}
leaseholder := routing.Leaseholder()
replicas, err := NewReplicaSlice(ctx, ds.nodeDescs, desc, leaseholder, replicaFilter)

However, when it receives the NLHE it attempts to move the new leaseholder to the front of the transport list:

if lh := routing.Leaseholder(); lh != nil {
// If the leaseholder is the replica that we've just tried, and
// we've tried this replica a bunch of times already, let's move on
// and not try it again. This prevents us getting stuck on a replica
// that we think has the lease but keeps returning redirects to us
// (possibly because it hasn't applied its lease yet). Perhaps that
// lease expires and someone else gets a new one, so by moving on we
// get out of possibly infinite loops.
if *lh != curReplica || sameReplicaRetries < sameReplicaRetryLimit {
transport.MoveToFront(*lh)
}
}

But this won't happen because the non-nil Type doesn't satisfy the equality check in MoveToFront:

if gt.replicas[i] == replica {

Jira issue: CRDB-18022

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Jul 26, 2022
@erikgrinaker erikgrinaker self-assigned this Jul 26, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 26, 2022

cc @cockroachdb/replication

@blathers-crl blathers-crl bot added the A-kv-replication Relating to Raft, consensus, and coordination. label Jul 26, 2022
@erikgrinaker erikgrinaker changed the title kvserver: ReplicaDescriptor.Type breaks equality comparisons kvserver: ReplicaDescriptor.Type pointer breaks equality comparisons Jul 26, 2022
@erikgrinaker
Copy link
Contributor Author

There's an additional issue with the DistSender here in that the transport will keep the initial range descriptor across all retries, regardless of whether we update the range descriptor in the meanwhile. This means that it will be vulnerable to changes in a replica's type during these retries regardless of the type change as well. I'll keep this issue to cover both.

@erikgrinaker erikgrinaker changed the title kvserver: ReplicaDescriptor.Type pointer breaks equality comparisons kvserver: incorrect DistSender leaseholder processing for non-VOTERs Jul 27, 2022
@arulajmani
Copy link
Collaborator

There's an additional issue with the DistSender here in that the transport will keep the initial range descriptor across all retries, regardless of whether we update the range descriptor in the meanwhile.

@erikgrinaker I'm currently working on addressing exactly that with my patch for #75742 which is caused by what you describe. My current thinking is that we should recognize if the routing (range desc, lease) has been updated to a point where it's no longer compatible with the transport, in which case, we should bail early and just retry.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 28, 2022

Cool! I have a minor fix for the immediate problem here as well in #85140, which ignores the replica type when comparing replicas.

What you're saying makes sense. I suppose the alternative would be to update the transport on-the-fly, but I don't know if we really gain much performance-wise over just bailing and retrying.

What do you mean by compatible? I imagine it will bail if the given leaseholder isn't in the range descriptor. What about a partially matching range descriptor? Will it bail immediately, or keep trying the ones they have in common? Does it care about replica types?

I suppose the risk if it was too sensitive would be that it could start flapping between two replicas with very different views of the range descriptor. Although the generation would prevent that, I guess.

@arulajmani
Copy link
Collaborator

I suppose the alternative would be to update the transport on-the-fly, but I don't know if we really gain much performance-wise over just bailing and retrying.

We recently started shipping back range descriptors inside of NotLeaseHolderErrors. We currently don't make use of them, but the idea is to update the RangeCache with a fresher range descriptor if one was returned in the NLHE. This means bailing and retrying shouldn't cause a cache eviction + lookup, which should help reduce any performance concerns.

What do you mean by compatible? I imagine it will bail if the given leaseholder isn't in the range descriptor. What about a partially matching range descriptor? Will it bail immediately, or keep trying the ones they have in common? Does it care about replica types?

I was imagining that we'd bail if the leaseholder wasn't part of the transport. Even though the routing can have stale lease information/range descriptor we never expect it to regress. So, if the routing has been updated such that the leaseholder is not part of the transport, I don't think we get much by exhausting the transport before retrying. We should be able to bail early and retry with a new transport constructed with a fresh(er) range descriptor.
I'm detecting this by having MoveToFront return whether the leaseholder was successfully moved to the front or not. So, given the changes in your PR, it'll be agnostic to the replica type.

I suppose the risk if it was too sensitive would be that it could start flapping between two replicas with very different views of the range descriptor. Although the generation would prevent that, I guess.

Yup, things mostly work as expected with the lease sequences/range descriptor generation to make sure our cache never regresses. The only subtlety I found was around speculative leases -- whenever there's a speculative lease coming from a replica that has an older view of the range descriptor, we simply disregard it.

I'll add you to the PR as well once it's out.

@craig craig bot closed this as completed in 9351986 Jul 29, 2022
@erikgrinaker
Copy link
Contributor Author

Sounds good, thanks!

craig bot pushed a commit that referenced this issue Aug 1, 2022
84420: sql/logictests: don't fail parent test when using retry r=knz a=stevendanna

testutils.SucceedsSoon calls Fatal() on the passed testing.T.  Here,
we were calling SucceedsSoon with the root T, which in the case of a
subtest resulted in this error showing up in our logs

    testing.go:1169: test executed panic(nil) or runtime.Goexit:
    subtest may have called FailNow on a parent test

This moves to using SucceedsSoonError so that we can process errors
using the same formatting that we use elsewhere.

Release note: None

85120: roachpb: make range/replica descriptor fields non-nullable r=pavelkalinnikov a=erikgrinaker

This patch makes all fields in range/replica descriptors non-nullable,
fixing a long-standing TODO.

Additionally, this fixes a set of bugs where code would use regular
comparison operators (e.g. `==`) to compare replica descriptors, which
with nullable pointer fields would compare the memory locations rather
than the values. One practical consequence of this was that the
DistSender would fail to use a new leaseholder with a non-`VOTER` type
(e.g.  `VOTER_INCOMING`), instead continuing to try other replicas
before backing off. However, there are further issues surrounding this
bug and it will be addressed separately in a way that is more amenable
to backporting.

The preparatory work for this was done in ea720e3.

Touches #85060.
Touches #38308.
Touches #38465.

Release note: None

85352: opt: revert data race fix r=mgartner a=mgartner

This commit reverts #37972. We no longer lazily build filter props and
share them across multiple threads.

Release note: None

Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants