-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log #88596
kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log #88596
Conversation
… log This commit replaces the call to `Term(raftAppliedIndex)` with direct use of the new `RaftAppliedIndexTerm` field (added in c3bc064) when generating a `SnapshotMetadata` in service of the `raft.Storage.Snapshot` interface. As of v22.2, this field has been fully migrated in. First and foremost, this is a code simplification. However, it also helps with projects like cockroachdb#87050, where async Raft log writes make it possible for a Raft leader to apply an entry before it has been appended to the leader's own log. Such flexibility[^1] would help smooth out tail latency in any single replica's local log writes, even if that replica is the leader itself. This is an important characteristic of quorum systems that we fail to provide because of the tight coupling between the Raft leader's own log writes and the Raft leader's acknowledgement of committed proposals. [^1]: if safe, I haven't convinced myself that it is in all cases. It certainly is not for operations like non-loosely coupled log truncation.
This code is no longer needed on master.
8fe893b
to
89058d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
-- commits
line 20 at r1:
I didn't quite understand this. Is it saying that strongly coupled truncation is not safe is the leader is lagging? But the leader is the one making the decision and knows its own state, so I must be missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice cleanup.
I didn't quite understand this. Is it saying that strongly coupled truncation is not safe is the leader is lagging? But the leader is the one making the decision and knows its own state, so I must be missing something?
Can't really get a big problem out of this either.
- n1 (leader) has
append
method stuck (in particular let's assume nothing gets made durable) - n2 and n3 are working fine
- indexes 90..100 get appended to the log and commit via n2 and n3
- n1 starts a non-cooperative truncation, this may affect indexes <= 100 (since 100 is committed and there's currently no special casing of the leader's local state in that code IIRC)
- this commits at index 101 with quorum (n2,n3)
- n2 and n3 can apply the truncation, that's fine - worst case it'll lead to a snapshot for n1 later
- but if n1 applies it and then crash-restarts, we have:
- applied index = 90 (say)
- first index = 100
This isn't a great counter-example because we can easily fix this. The truncation code today already has to make sure the truncation is compatible with the local raft log, so it would be straightforward to turn it into a truncation that only affects [0, ..., appliedIndex]
if that isn't already the case anyway.
Today we can't see a truncation that extends "past" the local log, which we do on n1 in this example, but this isn't too important either and can be fixed.
@nvanbenschoten re: "if safe, I haven't convinced myself that it is in all cases" I think we briefly chatted about this a while ago and I mentioned that the leader completeness property would get in the way. I no longer think so. The property is:
There's nothing requiring the current leader to have all entries for the current term. The leader will certainly have to be aware of all entries in its term (after all, how else is it going to assign log indexes etc), but the entries do not have to be on stable storage in order for that to hold true. I wrote up #88699 to capture the explicit goal of reducing the blast radius of a slow leader. |
I'm still confused (bear with me), so I'll try to elaborate on my understanding and you can correct me. My understanding was that the async work was making the appends to the raft log async, but what is appended is fsynced. So when the leader decides on a truncation up to 100 and sends it via the raft log, it would be enacted (in the strongly-coupled case) by the async worker thread after it has finished the work of appending and fsyncing up to 100. I just realized you said "applied index = 90", which is the state machine. We would already have a problem because we do non-fsynced application, but don't because they are both synchronous in the same thread and share the same engine, and so the fsyncing for the log appending case takes care of fsyncing the previous state machine applications. With the async appending to the raft log, the state machine application can both lead and lag what is durable in the local raft log. If it leads, then by virtue of sharing the same engine, the fsyncing of the appends will make the applied index durable (we will still need to fixup the mutual consistency of the raft log and state machine consistency at crash recovery time, like what is done in ReplicasStorage.Init). If the application lags, because it happens asynchronously via the What am I missing? |
I agree that tightly coupled log truncation is safe even if When I wrote this, I was conflating a discussion about the safety (in terms of the Raft protocol's invariants) of a leader's applied index outpacing its stable last index with the safety (in terms of CRDB's use of Raft) of any replica's applied index outpacing its stable last index. The case where you can see this go wrong is on follower replicas with tightly coupled log truncation. In a world with async log writes and async log application, you can pretty easily construct a situation that looks like:
|
bors r+ |
Build succeeded: |
Oh, are we planning to let replicas apply entries they don't have in their durable log? I hadn't considered that but it makes sense to decouple the two as much as possible. |
My plan was to expose this as an option (default off) through |
This commit replaces the call to
Term(raftAppliedIndex)
with direct use of the newRaftAppliedIndexTerm
field (added in c3bc064) when generating aSnapshotMetadata
in service of theraft.Storage.Snapshot
interface. As of v22.2, this field has been fully migrated in.First and foremost, this is a code simplification. However, it also helps with projects like #87050, where async Raft log writes make it possible for a Raft leader to apply an entry before it has been appended to the leader's own log. Such flexibility1 would help smooth out tail latency in any single replica's local log writes, even if that replica is the leader itself. This is an important characteristic of quorum systems that we fail to provide because of the tight coupling between the Raft leader's own log writes and the Raft leader's acknowledgment of committed proposals.
Release justification: None. Don't backport to release-22.2.
Release note: None.
Footnotes
if safe, I haven't convinced myself that it is in all cases. It certainly is not for operations like non-loosely coupled log truncation. ↩