kvserver: communicate snapshot index back along with snapshot response #106793

tbg · 2023-07-13T22:27:03Z

This addresses the following race:

n1 runs a ConfChange that adds n2 as a learner.
n1 sends MsgApp to the learner.
n1 starts the INITIAL snapshot, say at index 100.
n2 receives n1's MsgApp. Since it's an uninitialized
Replica and its log is empty, it rejects this MsgApp.
n2 receives and applies the INITIAL snapshot, which prompts it to send an
affirmative MsgAppResp to n1.
n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
rejection comes with a RejectHint (suggested next index to try) of zero,
which is not in n1's log. In particular, the SnapshotIndex will likely be
higher than the index of the snapshot actually sent, say 101.
n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach¹
but is outside of the scope of this PR.

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in #106813.

Closes #87554.
Closes #97971.
Closes #84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.

https://github.com/cockroachdb/cockroach/issues/105177 ↩

cockroach-teamcity · 2023-07-13T22:27:37Z

This change is

pkg/kv/kvserver/raft_transport.go

pkg/kv/kvserver/replica_command.go

pkg/kv/kvserver/replica_learner_test.go

pkg/kv/kvserver/replica_raft.go

This will be used to hand the MsgAppResp generated at snapshot application back to the sender.

We want the initiator of a (potentially delegated) snapshot to be able to see the MsgAppResp that is generated on the recipient of the snapshot as a result of application. This commit does the plumbing, but the `*MsgAppResp` is always `nil`, i.e. no actual logic was added yet.

This addresses the following race: - n1 runs a ConfChange that adds n2 as a learner. - n1 sends MsgApp to the learner. - n1 starts the INITIAL snapshot, say at index 100. - n2 receives n1's MsgApp. Since it's an uninitialized Replica and its log is empty, it rejects this MsgApp. - n2 receives and applies the INITIAL snapshot, which prompts it to send an affirmative MsgAppResp to n1. - n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success)) - n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the rejection comes with a RejectHint (suggested next index to try) of zero, which is not in n1's log. In particular, the SnapshotIndex will likely be higher than the index of the snapshot actually sent, say 101. - n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101 so this is ignored and the follower remains in StateSnapshot. With this commit, the last two steps cannot happen: n2 transitions straight to StateReplicate because we step a copy of the affirmative MsgAppResp in. The later rejection will be dropped, since it is stale (you can't hint at index zero when you already have a positive index confirmed). I will add that there is no great testing for the above other than stressing the test with additional logging, noting the symptoms, and noting that they disappear with this commit. Scripted testing of this code is within reach[^1] but is outside of the scope of this PR. [^1]: cockroachdb#105177 There is an additional bit of brittleness that is silently suppressed by this commit, but which deserves to be fixed independently because how the problem gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its current LastIndex and uses it as the PendingSnapshot for the follower's Progress. At the time of writing, MsgAppResp that reconnect the follower to the log but which are not greater than or equal to PendingSnapshot are ignored. In effect, this means that perfectly good snapshots are thrown away if they happen to be a little bit stale. In the example above, the snapshot is stale: PendingSnapshot is 101, but the snapshot is at index 100. Then how does this commit (mostly) fix the problem, i.e. why isn't the snapshot discarded? The key is that when we synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection hasn't arrived yet and so the follower transitions into StateReplicate with a Match of 100. This is then enough so that raft recognizes the rejected MsgApp as stale (since it would regress on durably stored entries). However, there is an alternative example where the rejection arrives earlier: after the snapshot index has been picked, but before the follower has been transitioned into StateReplicate. For this to have a negative effect, an entry has to be appended to the leader's log between generating the snapshot and handling the rejection. Without the combination of delegated snapshots and sustained write activity on the leader, this window is small, and this combination is usually not present in tests but it may well be relevant in "real" clusters. We track addressing this in cockroachdb#106813. Closes cockroachdb#87554. Closes cockroachdb#97971. Closes cockroachdb#84242. Epic: None Release note (bug fix): removed a source of unnecessary Raft snapshots during replica movement.

This commit introduces a simple test that attempts to create a single range and add it to two followers. The expectation is that this should succeed using the replicate queue to send the snapshots. When run under `--stress`, in the past it would occasionally fail due to an incorrect state transition related to the StateProbe state. Release note: None Epic: none

tbg · 2023-07-21T11:30:49Z

TestSqlActivityUpdateTopLimitJob flaked in CI with a package timeout¹. However, the test itself ran only for 15s, so it was something else in that package that took it over the 14m55s threshold.

walkthrough is the video in https://github.com/cockroachdb/cockroach/issues/107339 ↩

tbg · 2023-07-21T11:58:30Z

bors r=erikgrinaker
TFTR!

craig · 2023-07-21T12:50:24Z

Build succeeded:

Bazel Essential CI (Cockroach)

The issue causing TestMergeQueueWithSlowNonVoterSnaps to flake was fixed in cockroachdb#106793. Part of: cockroachdb#85372 Release note: None

110858: kvserver: unskip TestMergeQueueWithSlowNonVoterSnaps r=miraradeva a=miraradeva The issue causing TestMergeQueueWithSlowNonVoterSnaps to flake was fixed in #106793. Part of: #85372 Release note: None Co-authored-by: Mira Radeva <[email protected]>

The issue causing TestMergeQueueWithSlowNonVoterSnaps to flake was fixed in #106793. Part of: #85372 Release note: None

tbg force-pushed the raft-snap-resp branch 2 times, most recently from eaa4258 to 49d8d4b Compare July 14, 2023 10:03

This was referenced Jul 14, 2023

kv: Test to verify raft snapshots are not needed #87554

Closed

kvserver: avoid ignored snapshot when snap index < raft's PendingSnapshot index #106813

Closed

tbg force-pushed the raft-snap-resp branch from 49d8d4b to 3ee798a Compare July 14, 2023 12:47

tbg requested a review from pav-kv July 17, 2023 14:20

tbg marked this pull request as ready for review July 17, 2023 14:21

tbg requested a review from a team as a code owner July 17, 2023 14:21

tbg requested a review from a team July 17, 2023 14:21

tbg mentioned this pull request Jul 18, 2023

kv/kvserver: TestAdminRelocateRange failed #84242

Closed

erikgrinaker self-requested a review July 19, 2023 08:11

erikgrinaker approved these changes Jul 19, 2023

View reviewed changes

tbg added 2 commits July 20, 2023 12:32

kvserverpb: add MsgAppResp field to {Delegate,}SnapshotResponse

9d75c37

This will be used to hand the MsgAppResp generated at snapshot application back to the sender.

tbg force-pushed the raft-snap-resp branch from 3ee798a to f59e019 Compare July 20, 2023 11:44

tbg and others added 4 commits July 21, 2023 11:03

kvserver: unskip TestAddVotersWithoutRaftQueue

5cb40f2

kvserver: unskip TestAdminRelocateRange

0e3b84b

tbg force-pushed the raft-snap-resp branch from f59e019 to 0e3b84b Compare July 21, 2023 09:03

tbg added the db-cy-23 label Jul 21, 2023

craig bot merged commit 4d7ec64 into cockroachdb:master Jul 21, 2023

This was referenced Jul 25, 2023

raft: allow ReportSnapshot to take the index of the snapshot #87583

Open

kv: Raft snapshots required when they should not be #87553

Closed

tbg deleted the raft-snap-resp branch July 26, 2023 08:08

tbg mentioned this pull request Jul 26, 2023

roachtest: splits/largerange/size=32GiB,nodes=6 failed [raft snaps; needs #106813] #104588

Closed

miraradeva mentioned this pull request Sep 18, 2023

kvserver: unskip TestMergeQueueWithSlowNonVoterSnaps #110858

Merged

blathers-crl bot pushed a commit that referenced this pull request Sep 20, 2023

kvserver: unskip TestMergeQueueWithSlowNonVoterSnaps

9036a19

The issue causing TestMergeQueueWithSlowNonVoterSnaps to flake was fixed in #106793. Part of: #85372 Release note: None

blathers-crl bot mentioned this pull request Sep 20, 2023

release-23.1: kvserver: unskip TestMergeQueueWithSlowNonVoterSnaps #111009

Closed

erikgrinaker mentioned this pull request Jan 8, 2024

Accept any snapshot that allows replication etcd-io/raft#110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: communicate snapshot index back along with snapshot response #106793

kvserver: communicate snapshot index back along with snapshot response #106793

tbg commented Jul 13, 2023 •

edited

Loading

cockroach-teamcity commented Jul 13, 2023

tbg commented Jul 21, 2023 •

edited

Loading

tbg commented Jul 21, 2023

craig bot commented Jul 21, 2023

kvserver: communicate snapshot index back along with snapshot response #106793

kvserver: communicate snapshot index back along with snapshot response #106793

Conversation

tbg commented Jul 13, 2023 • edited Loading

Footnotes

cockroach-teamcity commented Jul 13, 2023

tbg commented Jul 21, 2023 • edited Loading

Footnotes

tbg commented Jul 21, 2023

craig bot commented Jul 21, 2023

tbg commented Jul 13, 2023 •

edited

Loading

tbg commented Jul 21, 2023 •

edited

Loading