Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: enable raft.Config.ResumeReplicateBelowPendingSnapshot #114640

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Nov 17, 2023

Blocked on etcd-io/raft#110.

Note to self: address this TODO

// TODO(dan): This is super brittle and non-obvious. In the common case,
// this check avoids duplicate work, but in rare cases, we send the
// learner snap at an index before the one raft wanted here. The raft
// leader should be able to use logs to get the rest of the way, but it
// doesn't try. In this case, skipping the raft snapshot would mean that
// we have to wait for the next scanner cycle of the raft snapshot queue
// to pick it up again. So, punt the responsibility back to raft by
// telling it that the snapshot failed. If the learner snap ends up being
// sufficient, this message will be ignored, but if we hit the case
// described above, this will cause raft to keep asking for a snap and at
// some point the snapshot lock above will be released and we'll fall
// through to the logic below.
repl.reportSnapshotStatus(ctx, repDesc.ReplicaID, err)


This allows leaders to resume replication to a follower that applies a snapshot below the leader's PendingSnapshot, as long as the snapshot connects the follower to the leader's log. Otherwise, if we delegate snapshot sending to a replica that's lagging behind the leader, it may send a snapshot below the leader's PendingSnapshot, and this would then be rejected even though it's a perfectly fine snapshot.

Resolves #106813.
Resolves #114349.
Epic: none

Release note (bug fix): Previously, Raft snapshots delegated to followers could be rejected by the leader after application if the delegate was lagging behind the leader. These snapshots are now accepted as long as they connect the follower to the leader's log.

@erikgrinaker erikgrinaker added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Nov 17, 2023
@erikgrinaker erikgrinaker requested a review from pav-kv November 17, 2023 12:45
@erikgrinaker erikgrinaker self-assigned this Nov 17, 2023
Copy link

blathers-crl bot commented Nov 17, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

This allows leaders to resume replication to a follower that applies a
snapshot below the leader's `PendingSnapshot`, as long as the snapshot
connects the follower to the leader's log. Otherwise, if we delegate
snapshot sending to a replica that's lagging behind the leader, it may
send a snapshot below the leader's `PendingSnapshot`, and this would
then be rejected even though it's a perfectly fine snapshot.

Epic: none
Release note (bug fix): Previously, Raft snapshots delegated to
followers could be rejected by the leader after application if the
delegate was lagging behind the leader. These snapshots are now accepted
as long as they connect the follower to the leader's log.
@erikgrinaker erikgrinaker force-pushed the raft-resume-replicate-below-pending-snapshot branch from 87b6ef5 to 652d8f3 Compare January 5, 2024 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
2 participants