Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestAdminRelocateRange failed #84242

Closed
cockroach-teamcity opened this issue Jul 12, 2022 · 16 comments · Fixed by #106793
Closed

kv/kvserver: TestAdminRelocateRange failed #84242

cockroach-teamcity opened this issue Jul 12, 2022 · 16 comments · Fixed by #106793
Assignees
Labels
A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 12, 2022

kv/kvserver.TestAdminRelocateRange failed with artifacts on master @ 571bfa3afb3858ae84d8a8fcdbb0a38e058402a5:

=== RUN   TestAdminRelocateRange
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestAdminRelocateRange2871936122
    test_log_scope.go:80: use -show-logs to present logs inline
    client_relocate_range_test.go:238: 
        	Error Trace:	client_relocate_range_test.go:189
        	            				client_relocate_range_test.go:238
        	Error:      	Not equal: 
        	            	expected: 3
        	            	actual  : 2
        	Test:       	TestAdminRelocateRange
        	Messages:   	wrong number of atomic changes
    client_relocate_range_test.go:238: all changes:
    client_relocate_range_test.go:238: 1: [{ADD_VOTER n4,s4} {REMOVE_VOTER n1,s1}]
    client_relocate_range_test.go:238: 2: [{ADD_VOTER n6,s6} {REMOVE_VOTER n3,s3}]
    panic.go:661: -- test log scope end --
--- FAIL: TestAdminRelocateRange (1.93s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-17542

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jul 12, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jul 12, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Jul 12, 2022
@adityamaru
Copy link
Contributor

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestAdminRelocateRange failed with artifacts on master @ 158ebe679bee1386df875ef8a175963bc6ac0cd1:

=== RUN   TestAdminRelocateRange
    test_log_scope.go:162: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestAdminRelocateRange1850405510
    test_log_scope.go:80: use -show-logs to present logs inline
    client_relocate_range_test.go:238: 
        	Error Trace:	client_relocate_range_test.go:189
        	            				client_relocate_range_test.go:238
        	Error:      	Not equal: 
        	            	expected: 3
        	            	actual  : 2
        	Test:       	TestAdminRelocateRange
        	Messages:   	wrong number of atomic changes
    client_relocate_range_test.go:238: all changes:
    client_relocate_range_test.go:238: 1: [{ADD_VOTER n4,s4} {REMOVE_VOTER n1,s1}]
    client_relocate_range_test.go:238: 2: [{ADD_VOTER n6,s6} {REMOVE_VOTER n3,s3}]
    panic.go:500: -- test log scope end --
--- FAIL: TestAdminRelocateRange (2.21s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

adityamaru added a commit to adityamaru/cockroach that referenced this issue Sep 1, 2022
Refs: cockroachdb#84242

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None
@adityamaru
Copy link
Contributor

craig bot pushed a commit that referenced this issue Sep 5, 2022
87280: kv/kvserver: skip TestAdminRelocateRange r=tbg a=adityamaru

Refs: #84242

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

Co-authored-by: adityamaru <[email protected]>
@kvoli kvoli self-assigned this Sep 6, 2022
@kvoli
Copy link
Collaborator

kvoli commented Sep 6, 2022

When stressing, after a few hundred runs this test fails.

The end result is correct:

relocate (s1,s2,s3) -> (s4,s5,s6)

However one of the atomic swaps fails

s2 -> sx

due to the incoming voter (s5) being in StateProbe preventing it from receiving the lease, while the demoting learner (s2) is currently the leaseholder:

I220906 21:32:55.924113 13 kv/kvserver_test/client_relocate_range_test.go:59  [-] 359  AdminRelocateRange failed with error: while carrying out changes [{ADD_VOTER n5,s5} {REMOVE_VOTER n2,s2}]: refusing to transfer lease to (n5,s5):5VOTER_INCOMING because target may need a Raft snapshot: replica in StateProbe

This same behavior occurs with s2 being removed as the leaseholder, however with s6 being added instead.

The response to a replicia being in state probe is expected, however why is the replica which swaps with the leaseholder (voter incoming) behind - when it was just streamed a snapshot.

@andrewbaptist
Copy link
Collaborator

I don't know enough about the raft state machine, but it's possible that all replicas transition through StateProbe before getting to state StateReplicate. Generally, this should be a "fast" transition, and it likely is in this test also, however, it is possible to catch it between these two states.

Looking at AdminRelocateRange, it ends up calling AdminChangeReplicas which then calls initializeRaftLearners and finally sendSnapshot. This indeed waits until the snapshot has been delivered. What I can't tell is how the transition from StateProbe to StateReplicate happens. My guess is that it waits for a consensus of peers to ack the msg. This seems like it could happen after the snapshot is received...

@nvanbenschoten
Copy link
Member

I think @andrewbaptist is on the right track. When a snapshot is completed successfully, CRDB calls etcd/raft's RawNode.ReportSnapshot from Replica.reportSnapshotStatus. This transitions the follower to StateProbe here. As referenced in this comment, etcd/raft will then wait until the MsgAppResp that is sent out by the follower when applying the snapshot is received before transitioning to StateReplicate.

So there does seem to be a race here if the snapshot response races ahead of the MsgAppResp and then the outgoing leaseholder begins to transfer the lease before it has received the MsgAppResp.

I don't understand why etcd/raft was designed this way. Why wait for the MsgAppResp instead of transitioning straight to StateReplicate on RawNode.ReportSnapshot? Is it because ReportSnapshot is only intended to indicate that the snapshot has been sent and not that it has been applied? Is it because snapshots can be ignored if they are redundant or from an old term and the leader wants to use its existing MsgAppResp handling logic to deal with these cases instead of duplicating them? @tbg @bdarnell do you have any recollection of this?

I ask because I wonder whether it would be worth avoiding this race by either synthesizing a MsgAppResp in Replica.reportSnapshotStatus or sending back the corresponding MsgAppResp in the SnapshotResponse so that it can be handed to etcd/raft directly in Replica.reportSnapshotStatus (under raftMu) to avoid the race. We do wait for snapshot application before calling ReportSnapshot on the leader, so both of these options are available to us.

@tbg
Copy link
Member

tbg commented Sep 8, 2022

https://github.com/etcd-io/etcd/blob/bb1619f893cccb13acc4fca850f3677735f18b9a/raft/raft.go#L1251-L1262

Why wait for the MsgAppResp instead of transitioning straight to StateReplicate on RawNode.ReportSnapshot

That is a good question. This decision was made very early in the life of raft1 and I don't see a justification for it. At the time it was likely just as good to do either. Today, at least in CRDB, it seems a lot better to move to replicating state directly, which we already do (thanks to changes we made) on the MsgAppResp path2.

Looking at this code I think there's an actual problem that is unrelated to the issue at hand but is worse. What raft really wants when it asks for a snapshot is for a snapshot at a particular index3. We ignore that and send a snapshot at whatever the index is when we call GetSnapshot4. However, raft remembers the original index in the peer's progress, and so it has an effect in ReportSnapshot: the leader will consider the peer having caught up to the requested index, where it is actually caught up to some other index.

So if we ended up sending a snapshot that has a lower index than what raft requested, we'd have the leader believe that entries that are not in fact reflected on the followers state actually made it there. You can from that derive scenarios in which raft "commits" an entry that isn't on a quorum. Now without delegated follower snaps, this couldn't happen (applied index doesn't move backwards within an instance of the progress) but could it happen with delegation? The leaseholder is asked for a snapshot at 100, it asks a follower that is behind to send a snapshot and so it does, but maybe it only has applied the log up to index 50 and thus sends a snapshot at 50. To the best of my knowledge, there is no protection here - DelegatedSnapshotRequest does not carry a "min index"[^5]. I think this is a real issue, filed #87581.

I like the first alternative in this issue, which is to fix it at the level of raft. If we end up deciding to do that, we should also move straight to StateReplicate.

However, I think for 22.2 we need a more targeted solution. We could insert a bounded wait for the follower to leave StateProbe as part of the snapshot sending step.

Footnotes

  1. https://github.com/etcd-io/etcd/commit/67194c0b2272f7363ff05e7ad66246743e548c46#diff-b9adbc46e4a317ffbb3d11a66c38d6b9af41a09170d77d87efbd96d115da452fR608

  2. https://github.com/etcd-io/etcd/blob/bb1619f893cccb13acc4fca850f3677735f18b9a/raft/raft.go#L1251-L1262

  3. https://github.com/etcd-io/etcd/blob/bb1619f893cccb13acc4fca850f3677735f18b9a/raft/raft.go#L452-L470

  4. https://github.com/cockroachdb/cockroach/blob/3fe8ffc7f854b6ff9689735dbb1e6ba6e661e027/pkg/kv/kvserver/replica_raftstorage.go#L435-L440

@bdarnell
Copy link
Contributor

bdarnell commented Sep 8, 2022

My recollection is that the split between ReportSnapshot and MsgAppResp was intended to permit things like delegated snapshots at a different index: raft asks for a snapshot without specifying exactly when, a snapshot is sent, and then the receiving node reports its new state. However, this was incompletely thought out and as @tbg says it got things exactly backwards. We should be getting the matched index from MsgAppResp instead of from when we sent the snapshot (or enforce that the delegated snapshot is not sent at an index that's too low), while it should be fine to optimistically go to StateReplicate (StateReplicate vs StateProbe is just a performance optimization).

@nvanbenschoten
Copy link
Member

We could insert a bounded wait for the follower to leave StateProbe as part of the snapshot sending step.

This is interesting, as it hints at an answer to the question of "why do both ReportSnapshot and MsgAppResp transition out of StateSnapshot?" Naively, I would question why we can't get rid of ReportSnapshot, at least for successful snapshots.

Would it be correct to say that the reason we need ReportSnapshot(SnapshotFinish) is because the main raft transport can be lossy, but etcd/raft won't retry a snapshot on its own, so we need some reliable way to indicate that the snapshot attempt is complete?

If so, then maybe the "sending back the corresponding MsgAppResp in the SnapshotResponse so that it can be handed to etcd/raft directly in Replica.reportSnapshotStatus (under raftMu) to avoid the race" alternative is worth exploring further.

This relates to:

However, I think for 22.2 we need a more targeted solution. We could insert a bounded wait for the follower to leave StateProbe as part of the snapshot sending step.

If we had such a delay to allow the MsgAppResp to propagate, would we even want to call ReportSnapshot(SnapshotFinish)? Could we just wait for the follower to leave StateSnapshot?

@tbg
Copy link
Member

tbg commented Sep 9, 2022

Would it be correct to say that the reason we need ReportSnapshot(SnapshotFinish) is because the main raft transport can be lossy, but etcd/raft won't retry a snapshot on its own, so we need some reliable way to indicate that the snapshot attempt is complete?

Yes, that's exactly right. In StateSnapshot raft ceases to communicate with the follower at all, so if we mess it up it's deadlock time. (Except of course the way we set up the raft snap queue, it will just send another snapshot down the line and hopefully things will clear up that way since the MsgAppResp won't be lost every time).

I would like to get to the point where ReportSnapshot is just a performance optimization, even on a lossy transport. For example, let raft treat StateSnapshot exactly like StateProbing; the follower is periodically contacted (with an empty MsgApp) and so it will, once it has gotten a snap, either accept or send a rejection hint that sets the leader straight to whatever the actual index is. There would still be marginal value in calling ReportSnapshot to have raft try faster, but we really shouldn't give it an index that we promise is durable on the follower (after all, that sort of thing is better kept inside of raft) but the index would be a hint for the leader on a good log index to probe (assuming it's in advance of what the leader tracks, i.e. ~always). (I have to double check if we're promising durability as is, I think it's interpreted more as a hint, but anyway, this is the way things should work; passing a bogus index shouldn't cause correctness issues).

If so, then maybe the "sending back the corresponding MsgAppResp in the SnapshotResponse so that it can be handed to etcd/raft directly in Replica.reportSnapshotStatus (under raftMu) to avoid the race" alternative is worth exploring further.

That seems like a good short-term solution that walks us towards adding that index to ReportSnapshot that doesn't require raft changes. Unfortunately, it's not obvious that we can pull the MsgAppResp out of raft; we can't call .Ready in the apply snapshot code because that method is not idempotent. We'd have to call ReadyWithoutAdvance but that's not currently exported, so we're looking at a small raft change anyway. And even then, there might be other messages waiting so there is some awkward sifting through data that should better not happen.

I think I still favor making raft transition immediately to StateReplicate (via StateProbe) in (*Progress).BecomeProbe here: https://github.com/etcd-io/etcd/blob/6332731ea7db68fa3a058d45deaedbae371ae3f4/raft/raft.go#L1328-L1330

After we call ReportSnapshot, we are in StateProbe and have a nonzero Match, but we might as well be in StateReplicate, and that would prevent the problem. I think it would also prevent #87553, since the MsgAppResp(rej=true) would be discarded in StateReplicate (or should at least), since raft can tell that it must be a stale message. (I think it could do that in StateProbe too, but still, we need StateReplicate to do the lease transfer).

@tbg
Copy link
Member

tbg commented Sep 9, 2022

Most recent comment here: #87581 (comment)

I'm back to being in favor of what Nathan originally suggested, which is transporting the MsgAppResp to the leader so that it can be r.Step'ed into the state machine right after ReportSnapshot. This will move from StateSnapshot to StateProbe to StateReplicate and so we wouldn't see StateProbe in the subsequent step.

@kvoli kvoli removed this from the 22.2 milestone Mar 9, 2023
@kvoli kvoli removed their assignment Jun 29, 2023
@kvoli kvoli added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 6, 2023
@tbg
Copy link
Member

tbg commented Jul 13, 2023

re: the flakiness level, yeah it does seem extremely reduced (if not to practically zero). The historical flake reports in the issue above took ~2s to run, meaning they weren't on super overloaded machines. I'm currently stressing this test at an intensity that closely reflects that (~288 runs over 10 minutes) and have yet to see a failure, stressing with higher --cpu until my gceworker ran out of memory also didn't lead to a repro. Still, I'm not going to unskip this until the above failure mode is impossible.

tbg added a commit to tbg/cockroach that referenced this issue Jul 14, 2023
This addresses the following race:

- n1 runs a ConfChange that adds n2 as a learner.
- n1 sends MsgApp to the learner.
- n1 starts the INITIAL snapshot, say at index 100.
- n2 receives n1's MsgApp. Since it's an uninitialized
  Replica and its log is empty, it rejects this MsgApp.
- n2 receives and applies the INITIAL snapshot, which prompts it to send an
  affirmative MsgAppResp to n1.
- n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
- n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
  rejection comes with a RejectHint (suggested next index to try) of zero,
  which is not in n1's log. In particular, the SnapshotIndex will likely be
  higher than the index of the snapshot actually sent, say 101.
- n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
  so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach[^1]
but is outside of the scope of this PR.

[^1]: cockroachdb#105177

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in cockroachdb#106813.

Closes cockroachdb#87554.
Closes cockroachdb#97971.
Closes cockroachdb#84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.
tbg added a commit to tbg/raft that referenced this issue Jul 14, 2023
Prior to this commit, the leader would not take into account snapshots reported
by a follower unless they matched or exceeded the tracked PendingSnapshot index
(which is the leader's last index at the time of requesting the snapshot). This
is too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR makes that change.

In doing so, it addresses long-standing problems that we've encountered in
CockroachDB. Unless you create the snapshot immediately and locally when raft
emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at
the requested index. It is possible to get one above the requested index which
raft always accepted, but CockroachDB delegates snapshots to followers who
might be behind on applying the log, and it is awkward to have wait for log
application to send the snapshot just to satisfy an overly strict condition in
raft. Additionally, CockroachDB also sends snapshots preemptively when adding a
new replica since there are qualitative differences between an initial snapshot
and one needed to reconnect to the log and one does not want to wait for raft
to round-trip to the follower to realize that a snapshot is needed. In this
case, the sent snapshot is commonly behind the PendingSnapshot since the leader
transitions the follower into StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Tobias Grieger <[email protected]>
@tbg
Copy link
Member

tbg commented Jul 18, 2023

The combination of #106793 and (for a possibly much rarer failure mode) etcd-io/raft#84 should likely fix this issue "holistically". Since both are on Pavel's docket for review and realistically speaking they may only merge after my departure, I'm assigning this issue to Pavel.

@tbg tbg assigned pav-kv and unassigned tbg Jul 18, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jul 20, 2023
This addresses the following race:

- n1 runs a ConfChange that adds n2 as a learner.
- n1 sends MsgApp to the learner.
- n1 starts the INITIAL snapshot, say at index 100.
- n2 receives n1's MsgApp. Since it's an uninitialized
  Replica and its log is empty, it rejects this MsgApp.
- n2 receives and applies the INITIAL snapshot, which prompts it to send an
  affirmative MsgAppResp to n1.
- n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
- n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
  rejection comes with a RejectHint (suggested next index to try) of zero,
  which is not in n1's log. In particular, the SnapshotIndex will likely be
  higher than the index of the snapshot actually sent, say 101.
- n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
  so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach[^1]
but is outside of the scope of this PR.

[^1]: cockroachdb#105177

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in cockroachdb#106813.

Closes cockroachdb#87554.
Closes cockroachdb#97971.
Closes cockroachdb#84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.
craig bot pushed a commit that referenced this issue Jul 21, 2023
106793: kvserver: communicate snapshot index back along with snapshot response r=erikgrinaker a=tbg

This addresses the following race:

- n1 runs a ConfChange that adds n2 as a learner.
- n1 sends MsgApp to the learner.
- n1 starts the INITIAL snapshot, say at index 100.
- n2 receives n1's MsgApp. Since it's an uninitialized
  Replica and its log is empty, it rejects this MsgApp.
- n2 receives and applies the INITIAL snapshot, which prompts it to send an
  affirmative MsgAppResp to n1.
- n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
- n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
  rejection comes with a RejectHint (suggested next index to try) of zero,
  which is not in n1's log. In particular, the SnapshotIndex will likely be
  higher than the index of the snapshot actually sent, say 101.
- n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
  so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach[^1]
but is outside of the scope of this PR.

[^1]: #105177

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in #106813.

Closes #87554.
Closes #97971.
Closes #84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.


Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
@craig craig bot closed this as completed in e968604 Jul 21, 2023
THardy98 pushed a commit to THardy98/cockroach that referenced this issue Jul 24, 2023
This addresses the following race:

- n1 runs a ConfChange that adds n2 as a learner.
- n1 sends MsgApp to the learner.
- n1 starts the INITIAL snapshot, say at index 100.
- n2 receives n1's MsgApp. Since it's an uninitialized
  Replica and its log is empty, it rejects this MsgApp.
- n2 receives and applies the INITIAL snapshot, which prompts it to send an
  affirmative MsgAppResp to n1.
- n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
- n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
  rejection comes with a RejectHint (suggested next index to try) of zero,
  which is not in n1's log. In particular, the SnapshotIndex will likely be
  higher than the index of the snapshot actually sent, say 101.
- n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
  so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach[^1]
but is outside of the scope of this PR.

[^1]: cockroachdb#105177

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in cockroachdb#106813.

Closes cockroachdb#87554.
Closes cockroachdb#97971.
Closes cockroachdb#84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.
erikgrinaker pushed a commit to erikgrinaker/raft that referenced this issue Nov 17, 2023
Prior to this commit, the leader would not take into account snapshots reported
by a follower unless they matched or exceeded the tracked PendingSnapshot index
(which is the leader's last index at the time of requesting the snapshot). This
is too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR makes that change.

In doing so, it addresses long-standing problems that we've encountered in
CockroachDB. Unless you create the snapshot immediately and locally when raft
emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at
the requested index. It is possible to get one above the requested index which
raft always accepted, but CockroachDB delegates snapshots to followers who
might be behind on applying the log, and it is awkward to have wait for log
application to send the snapshot just to satisfy an overly strict condition in
raft. Additionally, CockroachDB also sends snapshots preemptively when adding a
new replica since there are qualitative differences between an initial snapshot
and one needed to reconnect to the log and one does not want to wait for raft
to round-trip to the follower to realize that a snapshot is needed. In this
case, the sent snapshot is commonly behind the PendingSnapshot since the leader
transitions the follower into StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker pushed a commit to erikgrinaker/raft that referenced this issue Nov 17, 2023
Prior to this commit, the leader would not take into account snapshots
reported by a follower unless they matched or exceeded the tracked
PendingSnapshot index (which is the leader's last index at the time of
requesting the snapshot). This is too inflexible: the leader should take
into account any snapshot that reconnects the follower to its log. This
PR adds a config option ResumeReplicateBelowPendingSnapshot that enables
this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Nov 17, 2023
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last index at the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Nov 17, 2023
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Nov 17, 2023
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 4, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 4, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 5, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 9, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR makes that change.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 9, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR makes that change.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants