backport-2.0: storage: prevent unbounded raft log growth without quorum #27868

Fixes cockroachdb#27772. This change adds safeguards to prevent cases where a raft log would grow without bound during loss of quorum scenarios. It also adds a new test that demonstrates that the raft log does not grow without bound in these cases. There are two cases that need to be handled to prevent the unbounded raft log growth observed in cockroachdb#27772. 1. When the leader proposes a command and cannot establish a quorum. In this case, we know the leader has the entry in its log, so there's no need to refresh it with `reasonTicks`. To avoid this, we no longer use `refreshTicks` as a leader. 2. When a follower proposes a command that is forwarded to the leader who cannot establish a quorum. In this case, the follower can't be sure (currently) that the leader got the proposal, so it needs to refresh using `reasonTicks`. However, the leader now detects duplicate forwarded proposals and avoids appending redundant entries to its log. It does so by maintaining a set of in-flight forwarded proposals that it has received during its term as leader. This set is reset after every leadership change. Both of these cases are tested against in the new TestLogGrowthWhenRefreshingPendingCommands. Without both of the safeguards introduced in this commit, the test fails. Release note (bug fix): Prevent unbounded growth of the raft log caused by a loss of quorum.

Release note: None

Fixes cockroachdb#27878. The test was flaky because a NodeLiveness update was getting stuck when killing a majority of nodes in a Range. The fix was to tie NodeLiveness' update context to its stopper so that liveness updates are canceled when their node begins to shut down. This was a real issue that I suspect would occasionally make nodes hang on shutdown when their liveness range was becoming unavailable. There are probably other issues in the same class of bugs, but stressing `TestLogGrowthWhenRefreshingPendingCommands` isn't showing anything else. In doing so, the change needed to extend `stopper.WithCancel` into `stopper.WithCancelOnQuiesce` and `stopper.WithCancelOnStop`. Release note: None

Before this change, `WithCancelOnQuiesce` and `WithCancelOnStop` were dangerous to use because they would never clean up memory. This meant that any use of the methods that happened more than a constant number of times would slowly leak memory. This was an issue in `client.Txn.rollback`. This change fixes the methods so that it's possible for callers to clean up after themselves. Added a warning to `Stopper.AddCloser` because this had a similar issue. Release note: None

Fixes cockroachdb#27908. Release note: None

Before this change the Stopper made no guarantee that `Closers` registered by the `AddCloser` method or `Contexts` tied to the `Stopper` with `WithCancelOn{Quiesce,Stop}` were ever properly cleaned up. In cases where these methods were called concurrently with the Stopper stopping, it was possible for the resources to leak. This change fixes this by handling cases where these methods are called concurrently with or after the Stopper has stopped. This allows them to provide stronger external guarantees. Release note: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backport-2.0: storage: prevent unbounded raft log growth without quorum #27868

backport-2.0: storage: prevent unbounded raft log growth without quorum #27868

Commits on Aug 6, 2018