Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: delegated snapshots must carry a min index #87581

Closed
tbg opened this issue Sep 8, 2022 · 4 comments · Fixed by #87702
Closed

kvserver: delegated snapshots must carry a min index #87581

tbg opened this issue Sep 8, 2022 · 4 comments · Fixed by #87702
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Sep 8, 2022

Describe the problem

When a leader delegates a snapshot to a follower, it must send along a minimum index for the snapshot.

Otherwise, we risk the following:

  • raft on r1/1 (leader+leaseholder) requests a snapshot for index 100 to r1/3
  • r1/1 delegates to r1/2
  • r1/2 is a bit slow, it has only caught up to index 50
  • r1/2 sends a snapshot at index 50 and acks to r1/1
  • r1/1 calls (*rawNode).ReportSnapshot and this causes raft to believe that index 100 is now reflected on r1/3
  • it now considers index 100 committed
  • but it isn't: only r1/1 has that index

It's worth looking at the first alternative as well, which I find appealing. The mismatch between how raft thinks snapshots work and how they really work in CRDB has concerned me at various times over the years. But I don't think we can do this for 22.2, so should consider the more targeted fix above. I'm actually not sure if 22.2 delegates any snapshots, so depending on whether we do this is a current bug in need of a fix, or something we need to fix before we actually delegate.

See #84242.

Alternatives

  • We could fix this in raft by passing an index to ReportSnapshot. In effect, acknowledging that at least in CRDB raft has much less control over the snapshot index than it believes. We would track the index at which we've actually send the snapshots, hand that to raft, and remove any sketchy corner cases that way. Other users of raft may like the current behavior though, so we should allow an index of zero to behave like the index raft previously requested. While we're there, we should also let the follower transition straight into StateReplicate (see kv/kvserver: TestAdminRelocateRange failed #84242).

Worse alternatives:

  • Could send the snapshot at the actual index requested. This is actually not really an option since it precludes us from using delegated snapshots - you can't magically materialize a snapshot at a lower applied index; this effectively means undoing log entries that you may not even have any more and even if you do, reverting WriteBatches is not an operation we support, not to mention the numerous side effects entries can have such as splits, etc.

  • We could let the follower return the index to the leader, who can then avoid calling ReportSnapshot or manually massage the raft state (..somehow) to make it safe. (Sketchy)

  • We could let our impl of raft.Storage return an index of 10 (raftInitialIndex) from Snapshot(), in effect making sure that ReportSnapshot can never move the follower's Match above its true Match. (Bit sketchy)

  • We could stop calling ReportSnapshot and rely fully on the MsgAppResp. If the MsgAppResp got lost for any reason, the raft snap queue would (ultimately and possibly after a scanner cycle only) pick up the replica. (Not good)

Jira issue: CRDB-19425

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Sep 8, 2022
@nvanbenschoten
Copy link
Member

r1/1 calls (*rawNode).ReportSnapshot and this causes raft to believe that index 100 is now reflected on r1/3

This is the crux of the issue, but I don't see how it's possible. Raft remembers the index of the pending snapshot, but it doesn't seem to assign it to the peer's Match index (used for quorum calculations). Instead, it only assigns it to the peer's Next index when transitioning to StateProbe.

Do you mind pointing at the code that makes this possible?

@andrewbaptist andrewbaptist self-assigned this Sep 8, 2022
@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Sep 8, 2022

In the "complete delegated" snapshot branch - the truncated index is already added.

I agree with @nvanbenschoten however that today it calls BecomeProbe after a ReportSnapshot so I don't think that it would automatically adjust its index too far forward. However, the whole probing after snapshot seems inefficient since we know the exact index at that point of time and could go straight to StateReplicate

I'm curious why we start sending MsgApp messages at all when we are in the StateSnapshot state. The best case is that the new replica will ignore them, the worse case is that the replica responds in a way that makes the snapshot useless.

@tbg
Copy link
Member Author

tbg commented Sep 9, 2022

Ah, good catch, Nathan! I misread that as moving Match forward, based on the information that we promised to have made the snapshot at the raft-requested index durably applied on the follower.

This then also explains why we go to probe, and not immediately to replicate. We need the MsgAppResp for that, which actually provides the "proof" of what the latest log index is now. So if we want to bypass StateProbe after the snapshot, sadly we need to pass in the index of the snapshot and are on the hook for it actually reflecting the durable state of the follower.

With all this, Nathan's suggestion to transfer the actual MsgAppResp with the snapshot response and to step it into the leader makes a ton of sense, since that way we avoid making taking on any new duties w.r.t correctness - we're just delivering a message Raft is sending anyway, to make sure it has been received when we consider the snapshot "done". We should look at how ugly this gets in practice. I think it's sensible to add ReadyWithoutAdvance to raft and then peek at the last message, which has to be this MsgAppResp if there is going to be one. (If there isn't one, the snapshot was discarded and that will be good to know too!).

I'm curious why we start sending MsgApp messages at all when we are in the StateSnapshot state. The best case is that the new replica will ignore them, the worse case is that the replica responds in a way that makes the snapshot useless.

I don't think we do, in StateSnapshot raft just hangs tight. However, you do send one message to get into StateSnapshot. Initially, the follower is in StateProbe and the leader tries to append to it. That will get rejected, and now we're in StateSnapshot (since the rejection hint will be zero, i.e. below what the leader can catch the follower up from via the log). The follower sends an MsgAppResp upon applying the snapshot, which fast-tracks the leader back to StateReplicate. Maybe I misunderstood your question.

tbg added a commit to tbg/cockroach that referenced this issue Sep 9, 2022
@tbg
Copy link
Member Author

tbg commented Sep 9, 2022

Since there isn't a bug here, and we have #84242 to track the original issue (follower in StateProbe even though we thought we were "done" catching it up), I sent #87702 to close this with a comment update.

craig bot pushed a commit that referenced this issue Sep 9, 2022
87614: sql/opt/norm: propagate kv errors from cast folding r=ajwerner a=ajwerner

Swallowing KV errors here leads to incorrect results. Writes can be missed and serializability can be silently violated. This comes up in the context of the randomized schema change testing.

May deal with #85677
relates to #80764

Release note: None

87646: backupccl: deflake backup tests r=adityamaru a=adityamaru

See individual commits.

Release note: None

87667: server: clean up some logging r=ajwerner a=ajwerner

For one, this fixes a format directive issue in the first line. It also stops rendering some arguments to strings directly so that redaction works correctly.

Release note: None

87702: kvserver: explain our use of (*raftGroup).ReportSnapshot r=andrewbaptist a=tbg

Closes #87581.

Release note: None


87713: ci: skip "integration"-style tests in testrace r=knz a=rickystewart

Closes #87700.
Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@craig craig bot closed this as completed in #87702 Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants