release-20.1: raftentry: fix accidental no-op in truncateFrom #63475

tbg · 2021-04-12T14:07:12Z

Backport 3/3 commits from #63302.

/cc @cockroachdb/release

When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, 100, 101, 102 to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( 103 and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

[101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes [50, 51, 52] which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What should be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index 50, whereas the first
cached index would be 101).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (#61990) of the
hotspotsplits/nodes=4 roachtest (which was only caught thanks to
#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

idx	101	102	103	104	105	106	107	108
term	9	9	9	9	9	9	9	9

the divergent follower would have this:

idx	101	102	103	104	105	106	107	108
term	9	9	7	7	7	9	9	9

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). hotspotsplits brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in #36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
n2-standard-4 instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes #61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.

When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.

This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None

cockroach-teamcity · 2021-04-12T14:08:21Z

This change is

- simplify `lo` replacement in truncateFrom - truncate before adding in `(*Cache).Add()` - add unit test coverage for ringBuf This commit also fixes a bug where `first()` did not return an invalid iterator as advertised. Since the resulting iterator's `.next()` would wrap around, the iterator would accidentally stay valid forever and traverse the buffer. `first()` now actually returns an invalid iterator, and so does `last()`, and a full code audit of all callers was carried out. The tests were updated to use simplified entry sizes. Previously, the entry size in tests sized the `ent.Data` slice, but the ring buffer uses `ent.Sized()`. Now we reverse engineer entries of the correct size in testing, which makes things vastly more easy to follow in test cases. Release note: None

tbg added 2 commits April 12, 2021 16:05

tbg requested a review from nvanbenschoten April 12, 2021 14:07

nvanbenschoten approved these changes Apr 12, 2021

View reviewed changes

tbg force-pushed the backport20.1-63302 branch from fd1f712 to d60fbfe Compare April 15, 2021 08:23

tbg merged commit e0c5f34 into cockroachdb:release-20.1 Apr 20, 2021

tbg deleted the backport20.1-63302 branch April 20, 2021 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.1: raftentry: fix accidental no-op in truncateFrom #63475

release-20.1: raftentry: fix accidental no-op in truncateFrom #63475

tbg commented Apr 12, 2021

cockroach-teamcity commented Apr 12, 2021

release-20.1: raftentry: fix accidental no-op in truncateFrom #63475

release-20.1: raftentry: fix accidental no-op in truncateFrom #63475

Conversation

tbg commented Apr 12, 2021

cockroach-teamcity commented Apr 12, 2021