kvserver: delegated snapshots must carry a min index #87581

tbg · 2022-09-08T08:33:06Z

Describe the problem

When a leader delegates a snapshot to a follower, it must send along a minimum index for the snapshot.

Otherwise, we risk the following:

raft on r1/1 (leader+leaseholder) requests a snapshot for index 100 to r1/3
r1/1 delegates to r1/2
r1/2 is a bit slow, it has only caught up to index 50
r1/2 sends a snapshot at index 50 and acks to r1/1
r1/1 calls (*rawNode).ReportSnapshot and this causes raft to believe that index 100 is now reflected on r1/3
it now considers index 100 committed
but it isn't: only r1/1 has that index

It's worth looking at the first alternative as well, which I find appealing. The mismatch between how raft thinks snapshots work and how they really work in CRDB has concerned me at various times over the years. But I don't think we can do this for 22.2, so should consider the more targeted fix above. I'm actually not sure if 22.2 delegates any snapshots, so depending on whether we do this is a current bug in need of a fix, or something we need to fix before we actually delegate.

See #84242.

Alternatives

We could fix this in raft by passing an index to ReportSnapshot. In effect, acknowledging that at least in CRDB raft has much less control over the snapshot index than it believes. We would track the index at which we've actually send the snapshots, hand that to raft, and remove any sketchy corner cases that way. Other users of raft may like the current behavior though, so we should allow an index of zero to behave like the index raft previously requested. While we're there, we should also let the follower transition straight into StateReplicate (see kv/kvserver: TestAdminRelocateRange failed #84242).

Worse alternatives:

Could send the snapshot at the actual index requested. This is actually not really an option since it precludes us from using delegated snapshots - you can't magically materialize a snapshot at a lower applied index; this effectively means undoing log entries that you may not even have any more and even if you do, reverting WriteBatches is not an operation we support, not to mention the numerous side effects entries can have such as splits, etc.
We could let the follower return the index to the leader, who can then avoid calling ReportSnapshot or manually massage the raft state (..somehow) to make it safe. (Sketchy)
We could let our impl of raft.Storage return an index of 10 (raftInitialIndex) from Snapshot(), in effect making sure that ReportSnapshot can never move the follower's Match above its true Match. (Bit sketchy)
We could stop calling ReportSnapshot and rely fully on the MsgAppResp. If the MsgAppResp got lost for any reason, the raft snap queue would (ultimately and possibly after a scanner cycle only) pick up the replica. (Not good)

Jira issue: CRDB-19425

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2022-09-08T19:58:56Z

r1/1 calls (*rawNode).ReportSnapshot and this causes raft to believe that index 100 is now reflected on r1/3

This is the crux of the issue, but I don't see how it's possible. Raft remembers the index of the pending snapshot, but it doesn't seem to assign it to the peer's Match index (used for quorum calculations). Instead, it only assigns it to the peer's Next index when transitioning to StateProbe.

Do you mind pointing at the code that makes this possible?

andrewbaptist · 2022-09-08T20:50:57Z

In the "complete delegated" snapshot branch - the truncated index is already added.

I agree with @nvanbenschoten however that today it calls BecomeProbe after a ReportSnapshot so I don't think that it would automatically adjust its index too far forward. However, the whole probing after snapshot seems inefficient since we know the exact index at that point of time and could go straight to StateReplicate

I'm curious why we start sending MsgApp messages at all when we are in the StateSnapshot state. The best case is that the new replica will ignore them, the worse case is that the replica responds in a way that makes the snapshot useless.

tbg · 2022-09-09T14:06:10Z

Ah, good catch, Nathan! I misread that as moving Match forward, based on the information that we promised to have made the snapshot at the raft-requested index durably applied on the follower.

This then also explains why we go to probe, and not immediately to replicate. We need the MsgAppResp for that, which actually provides the "proof" of what the latest log index is now. So if we want to bypass StateProbe after the snapshot, sadly we need to pass in the index of the snapshot and are on the hook for it actually reflecting the durable state of the follower.

With all this, Nathan's suggestion to transfer the actual MsgAppResp with the snapshot response and to step it into the leader makes a ton of sense, since that way we avoid making taking on any new duties w.r.t correctness - we're just delivering a message Raft is sending anyway, to make sure it has been received when we consider the snapshot "done". We should look at how ugly this gets in practice. I think it's sensible to add ReadyWithoutAdvance to raft and then peek at the last message, which has to be this MsgAppResp if there is going to be one. (If there isn't one, the snapshot was discarded and that will be good to know too!).

I'm curious why we start sending MsgApp messages at all when we are in the StateSnapshot state. The best case is that the new replica will ignore them, the worse case is that the replica responds in a way that makes the snapshot useless.

I don't think we do, in StateSnapshot raft just hangs tight. However, you do send one message to get into StateSnapshot. Initially, the follower is in StateProbe and the leader tries to append to it. That will get rejected, and now we're in StateSnapshot (since the rejection hint will be zero, i.e. below what the leader can catch the follower up from via the log). The follower sends an MsgAppResp upon applying the snapshot, which fast-tracks the leader back to StateReplicate. Maybe I misunderstood your question.

Closes cockroachdb#87581. Release note: None

tbg · 2022-09-09T14:20:14Z

Since there isn't a bug here, and we have #84242 to track the original issue (follower in StateProbe even though we thought we were "done" catching it up), I sent #87702 to close this with a comment update.

87614: sql/opt/norm: propagate kv errors from cast folding r=ajwerner a=ajwerner Swallowing KV errors here leads to incorrect results. Writes can be missed and serializability can be silently violated. This comes up in the context of the randomized schema change testing. May deal with #85677 relates to #80764 Release note: None 87646: backupccl: deflake backup tests r=adityamaru a=adityamaru See individual commits. Release note: None 87667: server: clean up some logging r=ajwerner a=ajwerner For one, this fixes a format directive issue in the first line. It also stops rendering some arguments to strings directly so that redaction works correctly. Release note: None 87702: kvserver: explain our use of (*raftGroup).ReportSnapshot r=andrewbaptist a=tbg Closes #87581. Release note: None 87713: ci: skip "integration"-style tests in testrace r=knz a=rickystewart Closes #87700. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Sep 8, 2022

This was referenced Sep 8, 2022

kv/kvserver: TestAdminRelocateRange failed #84242

Closed

raft: allow ReportSnapshot to take the index of the snapshot #87583

Open

andrewbaptist self-assigned this Sep 8, 2022

tbg mentioned this issue Sep 9, 2022

kvserver: explain our use of (*raftGroup).ReportSnapshot #87702

Merged

tbg added a commit to tbg/cockroach that referenced this issue Sep 9, 2022

kvserver: explain our use of (*raftGroup).ReportSnapshot

b34ef27

Closes cockroachdb#87581. Release note: None

craig bot closed this as completed in #87702 Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: delegated snapshots must carry a min index #87581

kvserver: delegated snapshots must carry a min index #87581

tbg commented Sep 8, 2022 •

edited

Loading

nvanbenschoten commented Sep 8, 2022

andrewbaptist commented Sep 8, 2022 •

edited

Loading

tbg commented Sep 9, 2022

tbg commented Sep 9, 2022

kvserver: delegated snapshots must carry a min index #87581

kvserver: delegated snapshots must carry a min index #87581

Comments

tbg commented Sep 8, 2022 • edited Loading

nvanbenschoten commented Sep 8, 2022

andrewbaptist commented Sep 8, 2022 • edited Loading

tbg commented Sep 9, 2022

tbg commented Sep 9, 2022

tbg commented Sep 8, 2022 •

edited

Loading

andrewbaptist commented Sep 8, 2022 •

edited

Loading