-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: consider replica GC when leader loses quorum #61977
kvserver: consider replica GC when leader loses quorum #61977
Conversation
e121444
to
9b0058c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @irfansharif)
pkg/kv/kvserver/replica_gc_queue.go, line 168 at r1 (raw file):
} } } else if replDesc.GetType() != roachpb.VOTER_FULL {
@aayushshah15 this code (and the old code) mark non-voting replicas as suspect and check them eagerly. I assume this was not the intention. We should change this to typ := replDesc.GetType(); typ != roachpb.VOTER_FULL && typ != roachpb.NON_VOTER
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Could you add the other thing we discussed, namely forcing a consistency check on the range right after you've recovered? See ForceConsistencyQueueProcess
. This would be particularly powerful if you also disabled the replicaGCQueue before that call (so that the old replica is still around when the consistency check happens) but I won't push for that.
It is concerning that Irfan saw consistency failures in the original issue, I don't think he made that up.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @irfansharif)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @irfansharif)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I did do that "manually" while debugging this, and ran a bunch of stress tests, but never saw any evidence of consistency issues -- apart from log warnings that seemed entirely reasonable (I think because replicas were unavailable or something like that). Will update the tests and dig a bit more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @irfansharif)
pkg/kv/kvserver/replica_gc_queue.go, line 168 at r1 (raw file): Previously, tbg (Tobias Grieger) wrote…
You're right. Would it be okay to make the change in this PR or should I be creating a new patch for it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
pkg/kv/kvserver/replica_gc_queue.go, line 168 at r1 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
You're right. Would it be okay to make the change in this PR or should I be creating a new patch for it?
I can include it in this PR, as a separate commit.
9b0058c
to
8a427a6
Compare
Previously, a non-voting replica (with type `NON_VOTER`) would be considered suspect by the replica GC queue, and eagerly checked for GC every second. This patch changes the replica GC queue to not consider non-voting replicas suspect, since they aren't particularly suspect. Release note: None
8a427a6
to
8755b6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the other thing we discussed, namely forcing a consistency check on the range right after you've recovered? See ForceConsistencyQueueProcess. This would be particularly powerful if you also disabled the replicaGCQueue before that call (so that the old replica is still around when the consistency check happens) but I won't push for that.
So I ended up doing this by adding a non-voting replica to the mix, which (now) isn't considered suspect. I could've done this via testing knobs as well, but figured it'd be interesting to see how a non-voter behaves here. Consistency checks always pass (so far, anyway).
The non-voter will non-deterministically (rarely) GC its replica, so we don't assert anything about it except that any consistent reads will hit the reset replica. Might be interesting to try an inconsistent read as well, but didn't look into it yet.
I'm not sure what we'd really want the behavior of a non-voter to be. In the sort of scenario where ResetQuorum comes up, I'd imagine someone using follower reads for their app would actually want the non-voter to hang onto the replica for as long as possible.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @irfansharif, and @tbg)
pkg/kv/kvserver/replica_gc_queue.go, line 174 at r3 (raw file):
} else if t := replDesc.GetType(); t != roachpb.VOTER_FULL && t != roachpb.NON_VOTER { isSuspect = true } else {
I wonder if there's another conditional missing here. If a replica doesn't know who the leader is, shouldn't it also be suspect? The NON_VOTER in this test should fall into that category - it's a learner, so it couldn't ever become candidate or leader, but conceivably it could clear its view of the leader when the heartbeats stop (though it certainly wouldn't work that way for a quiesced non voter). Maybe not worth adding too many drive-by bells and whistles here.
pkg/kv/kvserver/reset_quorum_test.go, line 196 at r3 (raw file):
// Wait for n2, n3, n4, and n5 liveness to expire. time.Sleep(livenessDuration)
Just for kicks, could you use the HybridLogicalClock to avoid the sleep as shown here:
cockroach/pkg/kv/kvserver/node_liveness_test.go
Lines 313 to 314 in 0b01151
// Advance clock past the liveness threshold. | |
manualClock.Increment(tc.Servers[0].NodeLiveness().(*liveness.NodeLiveness).GetLivenessThreshold().Nanoseconds() + 1) |
pkg/kv/kvserver/reset_quorum_test.go, line 258 at r3 (raw file):
checkConsistency(t, srv, tc.Server(n4), tc.Server(n5)) // Non-voting replica (n5) may or may not have been GCed, so we don't
The replica should ultimately be gc'ed, right? It's just that it might take ten days. This is at odds though with expecting it to always return the newest value, no? The comment here suggests that you are reading this value off the non-voting replica, but I think what in fact happens is that you read from the leaseholder (the reset replica) so it doesn't really test anything about the non-voter. If you checked for the latest value for k
on the non-voting replica, you should see original
(or find that the replica no longer exists), though you'd have to sneak past the NotLeaseholderError by going directly to the engine (or try to do follower reads).
It's actually interesting to think about how follower reads would work here. Would we erroneously serve stale follower reads here or would that replica not see new timestamps getting closed out? cc @andreimatei
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @irfansharif)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)
pkg/kv/kvserver/replica_gc_queue.go, line 174 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I wonder if there's another conditional missing here. If a replica doesn't know who the leader is, shouldn't it also be suspect? The NON_VOTER in this test should fall into that category - it's a learner, so it couldn't ever become candidate or leader, but conceivably it could clear its view of the leader when the heartbeats stop (though it certainly wouldn't work that way for a quiesced non voter). Maybe not worth adding too many drive-by bells and whistles here.
Yeah, I was considering something like that as well. In typical ResetQuorum scenarios people are usually stressed out and want as few surprises as possible. The least surprising thing to do is probably to just GC all replicas. Are non-voters the only remaining stragglers, or are there other cases we should consider too?
pkg/kv/kvserver/reset_quorum_test.go, line 196 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Just for kicks, could you use the HybridLogicalClock to avoid the sleep as shown here:
cockroach/pkg/kv/kvserver/node_liveness_test.go
Lines 313 to 314 in 0b01151
// Advance clock past the liveness threshold. manualClock.Increment(tc.Servers[0].NodeLiveness().(*liveness.NodeLiveness).GetLivenessThreshold().Nanoseconds() + 1)
Tried this, but started seeing test failures (writes not taking effect and such). Probably because some processes rely on time flowing forwards.
pkg/kv/kvserver/reset_quorum_test.go, line 258 at r3 (raw file):
The replica should ultimately be gc'ed, right? It's just that it might take ten days.
Yep.
I think what in fact happens is that you read from the leaseholder (the reset replica) so it doesn't really test anything about the non-voter
Yeah, I wanted to test that the non-voter would actually pick up the new n1 leaseholder even though it's not in its own range descriptor (which has n2,n3,n4 as voters). Would be interesting to try a follower read, but we should probably just GC it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @irfansharif)
pkg/kv/kvserver/replica_gc_queue.go, line 174 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yeah, I was considering something like that as well. In typical ResetQuorum scenarios people are usually stressed out and want as few surprises as possible. The least surprising thing to do is probably to just GC all replicas. Are non-voters the only remaining stragglers, or are there other cases we should consider too?
Quiesced ranges in general are hard to GC in a timely way because nothing wakes them up. This could likely affect the non-voter in this test too. If it's quiesced (i.e. no raft heartbeats are sent to it before the outage) it isn't particularly concerned about not hearing from the leader in a while. It is only after getting woken up (by a request that prompts a a lease acquisition, for example) that the heuristics here would kick in. We could reach out to all of the old replicas as part of ResetQuorum
, but I'd be hesitant to do so in this PR.
pkg/kv/kvserver/reset_quorum_test.go, line 196 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Tried this, but started seeing test failures (writes not taking effect and such). Probably because some processes rely on time flowing forwards.
Can you give me a branch that shows this? This seems like a clear-cut case where this should work, and if it doesn't we should fix it.
pkg/kv/kvserver/reset_quorum_test.go, line 258 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
The replica should ultimately be gc'ed, right? It's just that it might take ten days.
Yep.
I think what in fact happens is that you read from the leaseholder (the reset replica) so it doesn't really test anything about the non-voter
Yeah, I wanted to test that the non-voter would actually pick up the new n1 leaseholder even though it's not in its own range descriptor (which has n2,n3,n4 as voters). Would be interesting to try a follower read, but we should probably just GC it.
Oh, I see. I would make it clear when you're talking about the node the non-voter is on vs the non-voter itself. When you say "the non-voter" I consider the canonical interpretation of that to be the replica itself, but you're not touching that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)
pkg/kv/kvserver/replica_gc_queue.go, line 174 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Quiesced ranges in general are hard to GC in a timely way because nothing wakes them up. This could likely affect the non-voter in this test too. If it's quiesced (i.e. no raft heartbeats are sent to it before the outage) it isn't particularly concerned about not hearing from the leader in a while. It is only after getting woken up (by a request that prompts a a lease acquisition, for example) that the heuristics here would kick in. We could reach out to all of the old replicas as part of
ResetQuorum
, but I'd be hesitant to do so in this PR.
Good point, but we'd want to GC it once it wakes up though.
pkg/kv/kvserver/reset_quorum_test.go, line 196 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you give me a branch that shows this? This seems like a clear-cut case where this should work, and if it doesn't we should fix it.
Pushed it to a branch here:
https://github.com/erikgrinaker/cockroach/tree/resetquorum-manualclock-test
Fails on make test PKG=./pkg/kv/kvserver TESTS=TestResetQuorum/with-leaseholder-elsewhere TESTFLAGS=-v
with reset_quorum_test.go:140: key not empty, expected to be empty after resetting quorum
.
pkg/kv/kvserver/reset_quorum_test.go, line 258 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Oh, I see. I would make it clear when you're talking about the node the non-voter is on vs the non-voter itself. When you say "the non-voter" I consider the canonical interpretation of that to be the replica itself, but you're not touching that.
Got it, will try to be more precise.
The replica GC queue will normally wait 10 days before checking whether a replica should be GCed (e.g. by looking up the range descriptor). If the replica is considered suspect (e.g. when it is a Raft candidate), the descriptor will be checked every second instead. However, a leader was never considered suspect. This could cause a situation where a leader who lost a quorum and was replaced via `Node.ResetQuorum()` would cling onto the range until the 10 day GC timeout expired and it finally checked the range descriptor. This patch considers leaders who are not able to make progress suspect, thereby checking the range descriptor every second and GCing the replica if it has been removed from the descriptor. It also adds back a `ResetQuorum` test case that was removed previously as it would fail in this scenario. Release note (bug fix): A Raft leader who loses quorum will now relinquish its range lease and remove the replica if the range is recreated elsewhere, e.g. via `Node.ResetQuorum()`.
8755b6b
to
885eaec
Compare
bors r=tbg Opened #62075 to follow up on suspect replica heuristics. |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/kv/kvserver/reset_quorum_test.go, line 258 at r3 (raw file):
It's actually interesting to think about how follower reads would work here. Would we erroneously serve stale follower reads here or would that replica not see new timestamps getting closed out?
I couldn't immediately get enough context to understand the question. Can you describe the scenario if you still care?
I suspect this is worth backporting to 21.1 to let us use reset-quorum on 21.1 clusters. |
The replica GC queue will normally wait 10 days before checking whether
a replica should be GCed (e.g. by looking up the range descriptor). If
the replica is considered suspect (e.g. when it is a Raft candidate),
the descriptor will be checked every second instead.
However, a leader was never considered suspect. This could cause a
situation where a leader who lost a quorum and was replaced via
Node.ResetQuorum()
would cling onto the range until the 10 day GCtimeout expired and it finally checked the range descriptor.
This patch considers leaders who are not able to make progress suspect,
thereby checking the range descriptor every second and GCing the replica
if it has been removed from the descriptor. It also adds back a
ResetQuorum
test case that was removed previously as it would fail inthis scenario.
Resolves #58376.
Release note (bug fix): A Raft leader who loses quorum will now
relinquish its range lease and remove the replica if the range is
recreated elsewhere, e.g. via
Node.ResetQuorum()
.