kv/kvserver: TestAdminRelocateRange failed #84242

cockroach-teamcity · 2022-07-12T07:15:28Z

kv/kvserver.TestAdminRelocateRange failed with artifacts on master @ 571bfa3afb3858ae84d8a8fcdbb0a38e058402a5:

=== RUN   TestAdminRelocateRange
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestAdminRelocateRange2871936122
    test_log_scope.go:80: use -show-logs to present logs inline
    client_relocate_range_test.go:238: 
        	Error Trace:	client_relocate_range_test.go:189
        	            				client_relocate_range_test.go:238
        	Error:      	Not equal: 
        	            	expected: 3
        	            	actual  : 2
        	Test:       	TestAdminRelocateRange
        	Messages:   	wrong number of atomic changes
    client_relocate_range_test.go:238: all changes:
    client_relocate_range_test.go:238: 1: [{ADD_VOTER n4,s4} {REMOVE_VOTER n1,s1}]
    client_relocate_range_test.go:238: 2: [{ADD_VOTER n6,s6} {REMOVE_VOTER n3,s3}]
    panic.go:661: -- test log scope end --
--- FAIL: TestAdminRelocateRange (1.93s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv _{This test on roachdash | Improve this report!

Jira issue: CRDB-17542}

The text was updated successfully, but these errors were encountered:

adityamaru · 2022-08-08T14:58:58Z

Saw this flake again - https://teamcity.cockroachdb.com/viewLog.html?buildId=6010818&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_BazelUnitTests

cockroach-teamcity · 2022-08-13T07:08:32Z

kv/kvserver.TestAdminRelocateRange failed with artifacts on master @ 158ebe679bee1386df875ef8a175963bc6ac0cd1:

=== RUN   TestAdminRelocateRange
    test_log_scope.go:162: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestAdminRelocateRange1850405510
    test_log_scope.go:80: use -show-logs to present logs inline
    client_relocate_range_test.go:238: 
        	Error Trace:	client_relocate_range_test.go:189
        	            				client_relocate_range_test.go:238
        	Error:      	Not equal: 
        	            	expected: 3
        	            	actual  : 2
        	Test:       	TestAdminRelocateRange
        	Messages:   	wrong number of atomic changes
    client_relocate_range_test.go:238: all changes:
    client_relocate_range_test.go:238: 1: [{ADD_VOTER n4,s4} {REMOVE_VOTER n1,s1}]
    client_relocate_range_test.go:238: 2: [{ADD_VOTER n6,s6} {REMOVE_VOTER n3,s3}]
    panic.go:500: -- test log scope end --
--- FAIL: TestAdminRelocateRange (2.21s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

Refs: cockroachdb#84242 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None

adityamaru · 2022-09-01T15:39:30Z

This test is fairly flaky even though we've not seen recent pings to this isse - https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Ci_TestsGcpLinuxX8664BigVm&buildTypeId=&tab=testDetails&testNameId=-5086970665568297757&order=TEST_STATUS_DESC&branch_Cockroach_Ci_TestsGcpLinuxX8664BigVm=__all_branches__&itemsCount=50, we're going to skip this until its stabilized.

87280: kv/kvserver: skip TestAdminRelocateRange r=tbg a=adityamaru Refs: #84242 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Co-authored-by: adityamaru <[email protected]>

kvoli · 2022-09-06T22:03:43Z

When stressing, after a few hundred runs this test fails.

The end result is correct:

relocate (s1,s2,s3) -> (s4,s5,s6)

However one of the atomic swaps fails

s2 -> sx

due to the incoming voter (s5) being in StateProbe preventing it from receiving the lease, while the demoting learner (s2) is currently the leaseholder:

I220906 21:32:55.924113 13 kv/kvserver_test/client_relocate_range_test.go:59  [-] 359  AdminRelocateRange failed with error: while carrying out changes [{ADD_VOTER n5,s5} {REMOVE_VOTER n2,s2}]: refusing to transfer lease to (n5,s5):5VOTER_INCOMING because target may need a Raft snapshot: replica in StateProbe

This same behavior occurs with s2 being removed as the leaseholder, however with s6 being added instead.

The response to a replicia being in state probe is expected, however why is the replica which swaps with the leaseholder (voter incoming) behind - when it was just streamed a snapshot.

andrewbaptist · 2022-09-06T23:16:47Z

I don't know enough about the raft state machine, but it's possible that all replicas transition through StateProbe before getting to state StateReplicate. Generally, this should be a "fast" transition, and it likely is in this test also, however, it is possible to catch it between these two states.

Looking at AdminRelocateRange, it ends up calling AdminChangeReplicas which then calls initializeRaftLearners and finally sendSnapshot. This indeed waits until the snapshot has been delivered. What I can't tell is how the transition from StateProbe to StateReplicate happens. My guess is that it waits for a consensus of peers to ack the msg. This seems like it could happen after the snapshot is received...

nvanbenschoten · 2022-09-07T21:14:54Z

I think @andrewbaptist is on the right track. When a snapshot is completed successfully, CRDB calls etcd/raft's RawNode.ReportSnapshot from Replica.reportSnapshotStatus. This transitions the follower to StateProbe here. As referenced in this comment, etcd/raft will then wait until the MsgAppResp that is sent out by the follower when applying the snapshot is received before transitioning to StateReplicate.

So there does seem to be a race here if the snapshot response races ahead of the MsgAppResp and then the outgoing leaseholder begins to transfer the lease before it has received the MsgAppResp.

I don't understand why etcd/raft was designed this way. Why wait for the MsgAppResp instead of transitioning straight to StateReplicate on RawNode.ReportSnapshot? Is it because ReportSnapshot is only intended to indicate that the snapshot has been sent and not that it has been applied? Is it because snapshots can be ignored if they are redundant or from an old term and the leader wants to use its existing MsgAppResp handling logic to deal with these cases instead of duplicating them? @tbg @bdarnell do you have any recollection of this?

I ask because I wonder whether it would be worth avoiding this race by either synthesizing a MsgAppResp in Replica.reportSnapshotStatus or sending back the corresponding MsgAppResp in the SnapshotResponse so that it can be handed to etcd/raft directly in Replica.reportSnapshotStatus (under raftMu) to avoid the race. We do wait for snapshot application before calling ReportSnapshot on the leader, so both of these options are available to us.

tbg · 2022-09-08T08:36:47Z

https://github.com/etcd-io/etcd/blob/bb1619f893cccb13acc4fca850f3677735f18b9a/raft/raft.go#L1251-L1262

Why wait for the MsgAppResp instead of transitioning straight to StateReplicate on RawNode.ReportSnapshot

That is a good question. This decision was made very early in the life of raft¹ and I don't see a justification for it. At the time it was likely just as good to do either. Today, at least in CRDB, it seems a lot better to move to replicating state directly, which we already do (thanks to changes we made) on the MsgAppResp path².

Looking at this code I think there's an actual problem that is unrelated to the issue at hand but is worse. What raft really wants when it asks for a snapshot is for a snapshot at a particular index³. We ignore that and send a snapshot at whatever the index is when we call GetSnapshot⁴. However, raft remembers the original index in the peer's progress, and so it has an effect in ReportSnapshot: the leader will consider the peer having caught up to the requested index, where it is actually caught up to some other index.

So if we ended up sending a snapshot that has a lower index than what raft requested, we'd have the leader believe that entries that are not in fact reflected on the followers state actually made it there. You can from that derive scenarios in which raft "commits" an entry that isn't on a quorum. Now without delegated follower snaps, this couldn't happen (applied index doesn't move backwards within an instance of the progress) but could it happen with delegation? The leaseholder is asked for a snapshot at 100, it asks a follower that is behind to send a snapshot and so it does, but maybe it only has applied the log up to index 50 and thus sends a snapshot at 50. To the best of my knowledge, there is no protection here - DelegatedSnapshotRequest does not carry a "min index"[^5]. I think this is a real issue, filed #87581.

I like the first alternative in this issue, which is to fix it at the level of raft. If we end up deciding to do that, we should also move straight to StateReplicate.

However, I think for 22.2 we need a more targeted solution. We could insert a bounded wait for the follower to leave StateProbe as part of the snapshot sending step.

bdarnell · 2022-09-08T15:15:17Z

My recollection is that the split between ReportSnapshot and MsgAppResp was intended to permit things like delegated snapshots at a different index: raft asks for a snapshot without specifying exactly when, a snapshot is sent, and then the receiving node reports its new state. However, this was incompletely thought out and as @tbg says it got things exactly backwards. We should be getting the matched index from MsgAppResp instead of from when we sent the snapshot (or enforce that the delegated snapshot is not sent at an index that's too low), while it should be fine to optimistically go to StateReplicate (StateReplicate vs StateProbe is just a performance optimization).

nvanbenschoten · 2022-09-08T20:30:04Z

We could insert a bounded wait for the follower to leave StateProbe as part of the snapshot sending step.

This is interesting, as it hints at an answer to the question of "why do both ReportSnapshot and MsgAppResp transition out of StateSnapshot?" Naively, I would question why we can't get rid of ReportSnapshot, at least for successful snapshots.

Would it be correct to say that the reason we need ReportSnapshot(SnapshotFinish) is because the main raft transport can be lossy, but etcd/raft won't retry a snapshot on its own, so we need some reliable way to indicate that the snapshot attempt is complete?

If so, then maybe the "sending back the corresponding MsgAppResp in the SnapshotResponse so that it can be handed to etcd/raft directly in Replica.reportSnapshotStatus (under raftMu) to avoid the race" alternative is worth exploring further.

This relates to:

However, I think for 22.2 we need a more targeted solution. We could insert a bounded wait for the follower to leave StateProbe as part of the snapshot sending step.

If we had such a delay to allow the MsgAppResp to propagate, would we even want to call ReportSnapshot(SnapshotFinish)? Could we just wait for the follower to leave StateSnapshot?

tbg · 2022-09-09T10:38:37Z

Would it be correct to say that the reason we need ReportSnapshot(SnapshotFinish) is because the main raft transport can be lossy, but etcd/raft won't retry a snapshot on its own, so we need some reliable way to indicate that the snapshot attempt is complete?

Yes, that's exactly right. In StateSnapshot raft ceases to communicate with the follower at all, so if we mess it up it's deadlock time. (Except of course the way we set up the raft snap queue, it will just send another snapshot down the line and hopefully things will clear up that way since the MsgAppResp won't be lost every time).

I would like to get to the point where ReportSnapshot is just a performance optimization, even on a lossy transport. For example, let raft treat StateSnapshot exactly like StateProbing; the follower is periodically contacted (with an empty MsgApp) and so it will, once it has gotten a snap, either accept or send a rejection hint that sets the leader straight to whatever the actual index is. There would still be marginal value in calling ReportSnapshot to have raft try faster, but we really shouldn't give it an index that we promise is durable on the follower (after all, that sort of thing is better kept inside of raft) but the index would be a hint for the leader on a good log index to probe (assuming it's in advance of what the leader tracks, i.e. ~always). (I have to double check if we're promising durability as is, I think it's interpreted more as a hint, but anyway, this is the way things should work; passing a bogus index shouldn't cause correctness issues).

If so, then maybe the "sending back the corresponding MsgAppResp in the SnapshotResponse so that it can be handed to etcd/raft directly in Replica.reportSnapshotStatus (under raftMu) to avoid the race" alternative is worth exploring further.

That seems like a good short-term solution that walks us towards adding that index to ReportSnapshot that doesn't require raft changes. Unfortunately, it's not obvious that we can pull the MsgAppResp out of raft; we can't call .Ready in the apply snapshot code because that method is not idempotent. We'd have to call ReadyWithoutAdvance but that's not currently exported, so we're looking at a small raft change anyway. And even then, there might be other messages waiting so there is some awkward sifting through data that should better not happen.

I think I still favor making raft transition immediately to StateReplicate (via StateProbe) in (*Progress).BecomeProbe here: https://github.com/etcd-io/etcd/blob/6332731ea7db68fa3a058d45deaedbae371ae3f4/raft/raft.go#L1328-L1330

After we call ReportSnapshot, we are in StateProbe and have a nonzero Match, but we might as well be in StateReplicate, and that would prevent the problem. I think it would also prevent #87553, since the MsgAppResp(rej=true) would be discarded in StateReplicate (or should at least), since raft can tell that it must be a stale message. (I think it could do that in StateProbe too, but still, we need StateReplicate to do the lease transfer).

tbg · 2022-09-09T14:10:14Z

Most recent comment here: #87581 (comment)

I'm back to being in favor of what Nathan originally suggested, which is transporting the MsgAppResp to the leader so that it can be r.Step'ed into the state machine right after ReportSnapshot. This will move from StateSnapshot to StateProbe to StateReplicate and so we wouldn't see StateProbe in the subsequent step.

tbg · 2023-07-13T22:46:13Z

re: the flakiness level, yeah it does seem extremely reduced (if not to practically zero). The historical flake reports in the issue above took ~2s to run, meaning they weren't on super overloaded machines. I'm currently stressing this test at an intensity that closely reflects that (~288 runs over 10 minutes) and have yet to see a failure, stressing with higher --cpu until my gceworker ran out of memory also didn't lead to a repro. Still, I'm not going to unskip this until the above failure mode is impossible.

This addresses the following race: - n1 runs a ConfChange that adds n2 as a learner. - n1 sends MsgApp to the learner. - n1 starts the INITIAL snapshot, say at index 100. - n2 receives n1's MsgApp. Since it's an uninitialized Replica and its log is empty, it rejects this MsgApp. - n2 receives and applies the INITIAL snapshot, which prompts it to send an affirmative MsgAppResp to n1. - n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success)) - n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the rejection comes with a RejectHint (suggested next index to try) of zero, which is not in n1's log. In particular, the SnapshotIndex will likely be higher than the index of the snapshot actually sent, say 101. - n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101 so this is ignored and the follower remains in StateSnapshot. With this commit, the last two steps cannot happen: n2 transitions straight to StateReplicate because we step a copy of the affirmative MsgAppResp in. The later rejection will be dropped, since it is stale (you can't hint at index zero when you already have a positive index confirmed). I will add that there is no great testing for the above other than stressing the test with additional logging, noting the symptoms, and noting that they disappear with this commit. Scripted testing of this code is within reach[^1] but is outside of the scope of this PR. [^1]: cockroachdb#105177 There is an additional bit of brittleness that is silently suppressed by this commit, but which deserves to be fixed independently because how the problem gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its current LastIndex and uses it as the PendingSnapshot for the follower's Progress. At the time of writing, MsgAppResp that reconnect the follower to the log but which are not greater than or equal to PendingSnapshot are ignored. In effect, this means that perfectly good snapshots are thrown away if they happen to be a little bit stale. In the example above, the snapshot is stale: PendingSnapshot is 101, but the snapshot is at index 100. Then how does this commit (mostly) fix the problem, i.e. why isn't the snapshot discarded? The key is that when we synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection hasn't arrived yet and so the follower transitions into StateReplicate with a Match of 100. This is then enough so that raft recognizes the rejected MsgApp as stale (since it would regress on durably stored entries). However, there is an alternative example where the rejection arrives earlier: after the snapshot index has been picked, but before the follower has been transitioned into StateReplicate. For this to have a negative effect, an entry has to be appended to the leader's log between generating the snapshot and handling the rejection. Without the combination of delegated snapshots and sustained write activity on the leader, this window is small, and this combination is usually not present in tests but it may well be relevant in "real" clusters. We track addressing this in cockroachdb#106813. Closes cockroachdb#87554. Closes cockroachdb#97971. Closes cockroachdb#84242. Epic: None Release note (bug fix): removed a source of unnecessary Raft snapshots during replica movement.

Prior to this commit, the leader would not take into account snapshots reported by a follower unless they matched or exceeded the tracked PendingSnapshot index (which is the leader's last index at the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR makes that change. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Tobias Grieger <[email protected]>

tbg · 2023-07-18T12:05:01Z

The combination of #106793 and (for a possibly much rarer failure mode) etcd-io/raft#84 should likely fix this issue "holistically". Since both are on Pavel's docket for review and realistically speaking they may only merge after my departure, I'm assigning this issue to Pavel.

This addresses the following race: - n1 runs a ConfChange that adds n2 as a learner. - n1 sends MsgApp to the learner. - n1 starts the INITIAL snapshot, say at index 100. - n2 receives n1's MsgApp. Since it's an uninitialized Replica and its log is empty, it rejects this MsgApp. - n2 receives and applies the INITIAL snapshot, which prompts it to send an affirmative MsgAppResp to n1. - n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success)) - n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the rejection comes with a RejectHint (suggested next index to try) of zero, which is not in n1's log. In particular, the SnapshotIndex will likely be higher than the index of the snapshot actually sent, say 101. - n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101 so this is ignored and the follower remains in StateSnapshot. With this commit, the last two steps cannot happen: n2 transitions straight to StateReplicate because we step a copy of the affirmative MsgAppResp in. The later rejection will be dropped, since it is stale (you can't hint at index zero when you already have a positive index confirmed). I will add that there is no great testing for the above other than stressing the test with additional logging, noting the symptoms, and noting that they disappear with this commit. Scripted testing of this code is within reach[^1] but is outside of the scope of this PR. [^1]: cockroachdb#105177 There is an additional bit of brittleness that is silently suppressed by this commit, but which deserves to be fixed independently because how the problem gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its current LastIndex and uses it as the PendingSnapshot for the follower's Progress. At the time of writing, MsgAppResp that reconnect the follower to the log but which are not greater than or equal to PendingSnapshot are ignored. In effect, this means that perfectly good snapshots are thrown away if they happen to be a little bit stale. In the example above, the snapshot is stale: PendingSnapshot is 101, but the snapshot is at index 100. Then how does this commit (mostly) fix the problem, i.e. why isn't the snapshot discarded? The key is that when we synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection hasn't arrived yet and so the follower transitions into StateReplicate with a Match of 100. This is then enough so that raft recognizes the rejected MsgApp as stale (since it would regress on durably stored entries). However, there is an alternative example where the rejection arrives earlier: after the snapshot index has been picked, but before the follower has been transitioned into StateReplicate. For this to have a negative effect, an entry has to be appended to the leader's log between generating the snapshot and handling the rejection. Without the combination of delegated snapshots and sustained write activity on the leader, this window is small, and this combination is usually not present in tests but it may well be relevant in "real" clusters. We track addressing this in cockroachdb#106813. Closes cockroachdb#87554. Closes cockroachdb#97971. Closes cockroachdb#84242. Epic: None Release note (bug fix): removed a source of unnecessary Raft snapshots during replica movement.

106793: kvserver: communicate snapshot index back along with snapshot response r=erikgrinaker a=tbg This addresses the following race: - n1 runs a ConfChange that adds n2 as a learner. - n1 sends MsgApp to the learner. - n1 starts the INITIAL snapshot, say at index 100. - n2 receives n1's MsgApp. Since it's an uninitialized Replica and its log is empty, it rejects this MsgApp. - n2 receives and applies the INITIAL snapshot, which prompts it to send an affirmative MsgAppResp to n1. - n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success)) - n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the rejection comes with a RejectHint (suggested next index to try) of zero, which is not in n1's log. In particular, the SnapshotIndex will likely be higher than the index of the snapshot actually sent, say 101. - n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101 so this is ignored and the follower remains in StateSnapshot. With this commit, the last two steps cannot happen: n2 transitions straight to StateReplicate because we step a copy of the affirmative MsgAppResp in. The later rejection will be dropped, since it is stale (you can't hint at index zero when you already have a positive index confirmed). I will add that there is no great testing for the above other than stressing the test with additional logging, noting the symptoms, and noting that they disappear with this commit. Scripted testing of this code is within reach[^1] but is outside of the scope of this PR. [^1]: #105177 There is an additional bit of brittleness that is silently suppressed by this commit, but which deserves to be fixed independently because how the problem gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its current LastIndex and uses it as the PendingSnapshot for the follower's Progress. At the time of writing, MsgAppResp that reconnect the follower to the log but which are not greater than or equal to PendingSnapshot are ignored. In effect, this means that perfectly good snapshots are thrown away if they happen to be a little bit stale. In the example above, the snapshot is stale: PendingSnapshot is 101, but the snapshot is at index 100. Then how does this commit (mostly) fix the problem, i.e. why isn't the snapshot discarded? The key is that when we synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection hasn't arrived yet and so the follower transitions into StateReplicate with a Match of 100. This is then enough so that raft recognizes the rejected MsgApp as stale (since it would regress on durably stored entries). However, there is an alternative example where the rejection arrives earlier: after the snapshot index has been picked, but before the follower has been transitioned into StateReplicate. For this to have a negative effect, an entry has to be appended to the leader's log between generating the snapshot and handling the rejection. Without the combination of delegated snapshots and sustained write activity on the leader, this window is small, and this combination is usually not present in tests but it may well be relevant in "real" clusters. We track addressing this in #106813. Closes #87554. Closes #97971. Closes #84242. Epic: None Release note (bug fix): removed a source of unnecessary Raft snapshots during replica movement. Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Andrew Baptist <[email protected]>

This addresses the following race: - n1 runs a ConfChange that adds n2 as a learner. - n1 sends MsgApp to the learner. - n1 starts the INITIAL snapshot, say at index 100. - n2 receives n1's MsgApp. Since it's an uninitialized Replica and its log is empty, it rejects this MsgApp. - n2 receives and applies the INITIAL snapshot, which prompts it to send an affirmative MsgAppResp to n1. - n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success)) - n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the rejection comes with a RejectHint (suggested next index to try) of zero, which is not in n1's log. In particular, the SnapshotIndex will likely be higher than the index of the snapshot actually sent, say 101. - n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101 so this is ignored and the follower remains in StateSnapshot. With this commit, the last two steps cannot happen: n2 transitions straight to StateReplicate because we step a copy of the affirmative MsgAppResp in. The later rejection will be dropped, since it is stale (you can't hint at index zero when you already have a positive index confirmed). I will add that there is no great testing for the above other than stressing the test with additional logging, noting the symptoms, and noting that they disappear with this commit. Scripted testing of this code is within reach[^1] but is outside of the scope of this PR. [^1]: cockroachdb#105177 There is an additional bit of brittleness that is silently suppressed by this commit, but which deserves to be fixed independently because how the problem gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its current LastIndex and uses it as the PendingSnapshot for the follower's Progress. At the time of writing, MsgAppResp that reconnect the follower to the log but which are not greater than or equal to PendingSnapshot are ignored. In effect, this means that perfectly good snapshots are thrown away if they happen to be a little bit stale. In the example above, the snapshot is stale: PendingSnapshot is 101, but the snapshot is at index 100. Then how does this commit (mostly) fix the problem, i.e. why isn't the snapshot discarded? The key is that when we synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection hasn't arrived yet and so the follower transitions into StateReplicate with a Match of 100. This is then enough so that raft recognizes the rejected MsgApp as stale (since it would regress on durably stored entries). However, there is an alternative example where the rejection arrives earlier: after the snapshot index has been picked, but before the follower has been transitioned into StateReplicate. For this to have a negative effect, an entry has to be appended to the leader's log between generating the snapshot and handling the rejection. Without the combination of delegated snapshots and sustained write activity on the leader, this window is small, and this combination is usually not present in tests but it may well be relevant in "real" clusters. We track addressing this in cockroachdb#106813. Closes cockroachdb#87554. Closes cockroachdb#97971. Closes cockroachdb#84242. Epic: None Release note (bug fix): removed a source of unnecessary Raft snapshots during replica movement.

Prior to this commit, the leader would not take into account snapshots reported by a follower unless they matched or exceeded the tracked PendingSnapshot index (which is the leader's last index at the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR makes that change. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Tobias Grieger <[email protected]>

Prior to this commit, the leader would not take into account snapshots reported by a follower unless they matched or exceeded the tracked PendingSnapshot index (which is the leader's last index at the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR adds a config option ResumeReplicateBelowPendingSnapshot that enables this behavior. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <[email protected]> Signed-off-by: Tobias Grieger <[email protected]>

A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last index at the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR adds a config option ResumeReplicateBelowPendingSnapshot that enables this behavior. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <[email protected]> Signed-off-by: Tobias Grieger <[email protected]>

A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR adds a config option ResumeReplicateBelowPendingSnapshot that enables this behavior. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <[email protected]> Signed-off-by: Tobias Grieger <[email protected]>

A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR makes that change. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <[email protected]> Signed-off-by: Tobias Grieger <[email protected]>

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jul 12, 2022

cockroach-teamcity added this to the 22.2 milestone Jul 12, 2022

blathers-crl bot added the T-kv KV Team label Jul 12, 2022

adityamaru added a commit to adityamaru/cockroach that referenced this issue Sep 1, 2022

kv/kvserver: skip TestAdminRelocateRange

db4a4bd

Refs: cockroachdb#84242 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None

adityamaru mentioned this issue Sep 1, 2022

kv/kvserver: skip TestAdminRelocateRange #87280

Merged

adityamaru added the skipped-test label Sep 1, 2022

kvoli self-assigned this Sep 6, 2022

tbg mentioned this issue Sep 8, 2022

kvserver: delegated snapshots must carry a min index #87581

Closed

tbg mentioned this issue Sep 8, 2022

raft: allow ReportSnapshot to take the index of the snapshot #87583

Open

nvanbenschoten added the GA-blocker label Mar 1, 2023

kvoli removed this from the 22.2 milestone Mar 9, 2023

nvanbenschoten removed the GA-blocker label Mar 13, 2023

kvoli removed their assignment Jun 29, 2023

kvoli added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 6, 2023

tbg mentioned this issue Jul 14, 2023

Accept any snapshot that allows replication etcd-io/raft#84

Closed

tbg mentioned this issue Jul 14, 2023

kvserver: communicate snapshot index back along with snapshot response #106793

Merged

exalate-issue-sync bot assigned tbg Jul 17, 2023

tbg assigned pav-kv and unassigned tbg Jul 18, 2023

craig bot closed this as completed in e968604 Jul 21, 2023

erikgrinaker mentioned this issue Nov 17, 2023

Accept any snapshot that allows replication etcd-io/raft#110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/kvserver: TestAdminRelocateRange failed #84242

kv/kvserver: TestAdminRelocateRange failed #84242

cockroach-teamcity commented Jul 12, 2022 •

edited by cockroach-jira-scripts

Loading

adityamaru commented Aug 8, 2022

cockroach-teamcity commented Aug 13, 2022

adityamaru commented Sep 1, 2022

kvoli commented Sep 6, 2022 •

edited

Loading

andrewbaptist commented Sep 6, 2022

nvanbenschoten commented Sep 7, 2022

tbg commented Sep 8, 2022

bdarnell commented Sep 8, 2022

nvanbenschoten commented Sep 8, 2022

tbg commented Sep 9, 2022

tbg commented Sep 9, 2022

tbg commented Jul 13, 2023 •

edited

Loading

tbg commented Jul 18, 2023

kv/kvserver: TestAdminRelocateRange failed #84242

kv/kvserver: TestAdminRelocateRange failed #84242

Comments

cockroach-teamcity commented Jul 12, 2022 • edited by cockroach-jira-scripts Loading

adityamaru commented Aug 8, 2022

cockroach-teamcity commented Aug 13, 2022

adityamaru commented Sep 1, 2022

kvoli commented Sep 6, 2022 • edited Loading

andrewbaptist commented Sep 6, 2022

nvanbenschoten commented Sep 7, 2022

tbg commented Sep 8, 2022

Footnotes

bdarnell commented Sep 8, 2022

nvanbenschoten commented Sep 8, 2022

tbg commented Sep 9, 2022

tbg commented Sep 9, 2022

tbg commented Jul 13, 2023 • edited Loading

tbg commented Jul 18, 2023

cockroach-teamcity commented Jul 12, 2022 •

edited by cockroach-jira-scripts

Loading

kvoli commented Sep 6, 2022 •

edited

Loading

tbg commented Jul 13, 2023 •

edited

Loading