kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log #88596

nvanbenschoten · 2022-09-23T18:28:39Z

This commit replaces the call to Term(raftAppliedIndex) with direct use of the new RaftAppliedIndexTerm field (added in c3bc064) when generating a SnapshotMetadata in service of the raft.Storage.Snapshot interface. As of v22.2, this field has been fully migrated in.

First and foremost, this is a code simplification. However, it also helps with projects like #87050, where async Raft log writes make it possible for a Raft leader to apply an entry before it has been appended to the leader's own log. Such flexibility¹ would help smooth out tail latency in any single replica's local log writes, even if that replica is the leader itself. This is an important characteristic of quorum systems that we fail to provide because of the tight coupling between the Raft leader's own log writes and the Raft leader's acknowledgment of committed proposals.

Release justification: None. Don't backport to release-22.2.

Release note: None.

if safe, I haven't convinced myself that it is in all cases. It certainly is not for operations like non-loosely coupled log truncation. ↩

cockroach-teamcity · 2022-09-23T18:28:51Z

This change is

… log This commit replaces the call to `Term(raftAppliedIndex)` with direct use of the new `RaftAppliedIndexTerm` field (added in c3bc064) when generating a `SnapshotMetadata` in service of the `raft.Storage.Snapshot` interface. As of v22.2, this field has been fully migrated in. First and foremost, this is a code simplification. However, it also helps with projects like cockroachdb#87050, where async Raft log writes make it possible for a Raft leader to apply an entry before it has been appended to the leader's own log. Such flexibility[^1] would help smooth out tail latency in any single replica's local log writes, even if that replica is the leader itself. This is an important characteristic of quorum systems that we fail to provide because of the tight coupling between the Raft leader's own log writes and the Raft leader's acknowledgement of committed proposals. [^1]: if safe, I haven't convinced myself that it is in all cases. It certainly is not for operations like non-loosely coupled log truncation.

This code is no longer needed on master.

sumeerbhola

Reviewed 1 of 1 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

-- commits line 20 at r1:
I didn't quite understand this. Is it saying that strongly coupled truncation is not safe is the leader is lagging? But the leader is the one making the decision and knows its own state, so I must be missing something?

tbg

Nice cleanup.

I didn't quite understand this. Is it saying that strongly coupled truncation is not safe is the leader is lagging? But the leader is the one making the decision and knows its own state, so I must be missing something?

Can't really get a big problem out of this either.

n1 (leader) has append method stuck (in particular let's assume nothing gets made durable)
n2 and n3 are working fine
indexes 90..100 get appended to the log and commit via n2 and n3
n1 starts a non-cooperative truncation, this may affect indexes <= 100 (since 100 is committed and there's currently no special casing of the leader's local state in that code IIRC)
this commits at index 101 with quorum (n2,n3)
n2 and n3 can apply the truncation, that's fine - worst case it'll lead to a snapshot for n1 later
but if n1 applies it and then crash-restarts, we have:
- applied index = 90 (say)
- first index = 100

This isn't a great counter-example because we can easily fix this. The truncation code today already has to make sure the truncation is compatible with the local raft log, so it would be straightforward to turn it into a truncation that only affects [0, ..., appliedIndex] if that isn't already the case anyway.

Today we can't see a truncation that extends "past" the local log, which we do on n1 in this example, but this isn't too important either and can be fixed.

tbg · 2022-09-26T08:10:37Z

@nvanbenschoten re: "if safe, I haven't convinced myself that it is in all cases"

I think we briefly chatted about this a while ago and I mentioned that the leader completeness property would get in the way. I no longer think so. The property is:

Leader Completeness
If a log entry is committed in a given term, then that entry will be present in the logs
of the leaders for all higher-numbered terms. §3.6

There's nothing requiring the current leader to have all entries for the current term. The leader will certainly have to be aware of all entries in its term (after all, how else is it going to assign log indexes etc), but the entries do not have to be on stable storage in order for that to hold true.

I wrote up #88699 to capture the explicit goal of reducing the blast radius of a slow leader.

sumeerbhola · 2022-09-26T15:17:21Z

n1 (leader) has append method stuck (in particular let's assume nothing gets made durable)
...

but if n1 applies it and then crash-restarts, we have:

applied index = 90 (say)

first index = 100

...
The truncation code today already has to make sure the truncation is compatible with the local raft log, so it would be straightforward to turn it into a truncation that only affects [0, ..., appliedIndex] if that isn't already the case anyway.

I'm still confused (bear with me), so I'll try to elaborate on my understanding and you can correct me.

My understanding was that the async work was making the appends to the raft log async, but what is appended is fsynced. So when the leader decides on a truncation up to 100 and sends it via the raft log, it would be enacted (in the strongly-coupled case) by the async worker thread after it has finished the work of appending and fsyncing up to 100.

I just realized you said "applied index = 90", which is the state machine. We would already have a problem because we do non-fsynced application, but don't because they are both synchronous in the same thread and share the same engine, and so the fsyncing for the log appending case takes care of fsyncing the previous state machine applications. With the async appending to the raft log, the state machine application can both lead and lag what is durable in the local raft log. If it leads, then by virtue of sharing the same engine, the fsyncing of the appends will make the applied index durable (we will still need to fixup the mutual consistency of the raft log and state machine consistency at crash recovery time, like what is done in ReplicasStorage.Init). If the application lags, because it happens asynchronously via the stableTo stuff in @nvanbenschoten's prototype, then we have the problem you indicated, because not only is the application not durable, it is possible that the application has not even been attempted. Your proposed solution seems to ensure that the application has been attempted on the leader, but it is not clear to me that it ensures (a) the application has been attempted on the follower before the follower truncates, (b) the attempted application is durable. It seems to me that we need loosely-coupled truncation to handle this correctly.

What am I missing?

nvanbenschoten · 2022-09-26T18:16:31Z

I agree that tightly coupled log truncation is safe even if applied > stable_last_index on the Raft leader, as long as the Raft leader is only ever proposing log truncation up to stable_last_index. That is guaranteed today by newTruncateDecision.

When I wrote this, I was conflating a discussion about the safety (in terms of the Raft protocol's invariants) of a leader's applied index outpacing its stable last index with the safety (in terms of CRDB's use of Raft) of any replica's applied index outpacing its stable last index.

The case where you can see this go wrong is on follower replicas with tightly coupled log truncation. In a world with async log writes and async log application, you can pretty easily construct a situation that looks like:

leader proposes + commits index 10-20
lease proposes + commits log truncation at index 21 that instructs replicas to truncate up to entry 15
follower (not part of original quorum) catches up on log after delay, receives log indexes 10-21 all at once, is immediately told that they are committed
follower applies index 10-21 before appending them to its local log. Truncation at index 15 performed before corresponding append
Leaked log entries

nvanbenschoten · 2022-09-27T15:23:27Z

bors r+

craig · 2022-09-27T16:35:10Z

Build succeeded:

Bazel Essential CI (Cockroach)

tbg · 2022-09-29T07:39:11Z

The case where you can see this go wrong is on follower replicas with tightly coupled log truncation. In a world with async log writes and async log application, you can pretty easily construct a situation that looks like:

Oh, are we planning to let replicas apply entries they don't have in their durable log? I hadn't considered that but it makes sense to decouple the two as much as possible.

nvanbenschoten · 2022-10-17T15:29:52Z

My plan was to expose this as an option (default off) through etcd/raft. It raises too many questions to assume that all users of the library can properly handle the applied state outpacing the log. Many might have unintentional dependencies between the two. I also don't want the integration of the raft refactor to depend on CRDB resolving all of these dependencies immediately.

nvanbenschoten requested review from tbg and sumeerbhola September 23, 2022 18:28

nvanbenschoten requested a review from a team as a code owner September 23, 2022 18:28

nvanbenschoten added 2 commits September 23, 2022 19:40

kv: remove handling of unmigrated RaftAppliedIndexTerm field

89058d5

This code is no longer needed on master.

nvanbenschoten force-pushed the nvanbenschoten/raftAppliedIndexTerm branch from 8fe893b to 89058d5 Compare September 23, 2022 23:40

sumeerbhola approved these changes Sep 25, 2022

View reviewed changes

tbg approved these changes Sep 26, 2022

View reviewed changes

tbg mentioned this pull request Sep 26, 2022

kvserver: allow committing entries not in leader's stable storage #88699

Open

craig bot merged commit 03b8afb into cockroachdb:master Sep 27, 2022

nvanbenschoten deleted the nvanbenschoten/raftAppliedIndexTerm branch September 30, 2022 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log #88596

kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log #88596

nvanbenschoten commented Sep 23, 2022

cockroach-teamcity commented Sep 23, 2022

sumeerbhola left a comment

tbg left a comment •

edited

Loading

tbg commented Sep 26, 2022

sumeerbhola commented Sep 26, 2022 •

edited

Loading

nvanbenschoten commented Sep 26, 2022

nvanbenschoten commented Sep 27, 2022

craig bot commented Sep 27, 2022

tbg commented Sep 29, 2022

nvanbenschoten commented Oct 17, 2022

kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log #88596

kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log #88596

Conversation

nvanbenschoten commented Sep 23, 2022

Footnotes

cockroach-teamcity commented Sep 23, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

tbg left a comment • edited Loading

Choose a reason for hiding this comment

tbg commented Sep 26, 2022

sumeerbhola commented Sep 26, 2022 • edited Loading

nvanbenschoten commented Sep 26, 2022

nvanbenschoten commented Sep 27, 2022

craig bot commented Sep 27, 2022

tbg commented Sep 29, 2022

nvanbenschoten commented Oct 17, 2022

tbg left a comment •

edited

Loading

sumeerbhola commented Sep 26, 2022 •

edited

Loading