storage: bump LeaseAppliedIndex instead of reproposing #27054
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
In #26830, we observed an instance in which the uncommitted portion of the Raft log grew to over 7GB of size, leading to an out-of-memory error due to being loaded into memory on election. This particular out-of-memory issue has been fixed, but not the process by which the uncommitted Raft log grew to 7GB in the first place.
This likely happened because the split queue retried the split repeatedly, each time proposing on the order of 100mb into Raft. This in itself is a problem (that has partially been addressed, see #25233) but a problematic piece of code related to this is the reproposal logic:
https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica.go#L4396-L4417
which repeatedly creates copies of proposals that don't commit within a target interval. Consider the case in which Raft commit is significantly delayed or even stalled (lost quorum) -- the mechanism is essentially blowing up the uncommitted tail of the Raft log for no good reason.
This raises the question why we're even trying to repropose these commands. We should simply propose a "no-op" through Raft that allocates a new LeaseAppliedIndex, after which the pending proposals below it are discarded (to be retried externally).
The semantics of that approach seem saner as we never duplicate large commands and as proposers already have to expect retries and are equipped to handle them.
@bdarnell this seems worth doing for 2.1, if you don't have any reservations. The change shouldn't be too big (probably most of the the time spent fixing up the tests that test this through intimate knowledge of the current code).
The text was updated successfully, but these errors were encountered: