Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backport-2.0: storage: prevent unbounded raft log growth without quorum #27868

Merged
merged 6 commits into from
Aug 6, 2018

Commits on Aug 6, 2018

  1. storage: prevent unbounded raft log growth without quorum

    Fixes cockroachdb#27772.
    
    This change adds safeguards to prevent cases where a raft log
    would grow without bound during loss of quorum scenarios. It
    also adds a new test that demonstrates that the raft log does
    not grow without bound in these cases.
    
    There are two cases that need to be handled to prevent the
    unbounded raft log growth observed in cockroachdb#27772.
    1. When the leader proposes a command and cannot establish a
       quorum. In this case, we know the leader has the entry in
       its log, so there's no need to refresh it with `reasonTicks`.
       To avoid this, we no longer use `refreshTicks` as a leader.
    2. When a follower proposes a command that is forwarded to the
       leader who cannot establish a quorum. In this case, the
       follower can't be sure (currently) that the leader got the
       proposal, so it needs to refresh using `reasonTicks`. However,
       the leader now detects duplicate forwarded proposals and
       avoids appending redundant entries to its log. It does so
       by maintaining a set of in-flight forwarded proposals that
       it has received during its term as leader. This set is reset
       after every leadership change.
    
    Both of these cases are tested against in the new
    TestLogGrowthWhenRefreshingPendingCommands. Without both of
    the safeguards introduced in this commit, the test fails.
    
    Release note (bug fix): Prevent unbounded growth of the raft log
    caused by a loss of quorum.
    nvanbenschoten committed Aug 6, 2018
    Configuration menu
    Copy the full SHA
    20cfb25 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    bc1e14a View commit details
    Browse the repository at this point in the history
  3. storage: cancel liveness update context on Stopper cancellation

    Fixes cockroachdb#27878.
    
    The test was flaky because a NodeLiveness update was getting stuck when
    killing a majority of nodes in a Range. The fix was to tie NodeLiveness'
    update context to its stopper so that liveness updates are canceled when
    their node begins to shut down.
    
    This was a real issue that I suspect would occasionally make nodes hang on
    shutdown when their liveness range was becoming unavailable. There are
    probably other issues in the same class of bugs, but stressing
    `TestLogGrowthWhenRefreshingPendingCommands` isn't showing anything else.
    
    In doing so, the change needed to extend `stopper.WithCancel` into
    `stopper.WithCancelOnQuiesce` and `stopper.WithCancelOnStop`.
    
    Release note: None
    nvanbenschoten committed Aug 6, 2018
    Configuration menu
    Copy the full SHA
    723f1af View commit details
    Browse the repository at this point in the history
  4. stopper: avoid memory leak in Stopper.WithCancel methods

    Before this change, `WithCancelOnQuiesce` and `WithCancelOnStop` were
    dangerous to use because they would never clean up memory. This meant
    that any use of the methods that happened more than a constant number
    of times would slowly leak memory. This was an issue in `client.Txn.rollback`.
    This change fixes the methods so that it's possible for callers to clean
    up after themselves.
    
    Added a warning to `Stopper.AddCloser` because this had a similar
    issue.
    
    Release note: None
    nvanbenschoten committed Aug 6, 2018
    Configuration menu
    Copy the full SHA
    b201df1 View commit details
    Browse the repository at this point in the history
  5. stopper: synchronize access to sCancels functions

    Fixes cockroachdb#27908.
    
    Release note: None
    nvanbenschoten committed Aug 6, 2018
    Configuration menu
    Copy the full SHA
    e2adebe View commit details
    Browse the repository at this point in the history
  6. stopper: Ensure that Closers and Ctxs are always cleaned up

    Before this change the Stopper made no guarantee that `Closers`
    registered by the `AddCloser` method or `Contexts` tied to
    the `Stopper` with `WithCancelOn{Quiesce,Stop}` were ever properly
    cleaned up. In cases where these methods were called concurrently
    with the Stopper stopping, it was possible for the resources to
    leak. This change fixes this by handling cases where these methods
    are called concurrently with or after the Stopper has stopped. This
    allows them to provide stronger external guarantees.
    
    Release note: None
    nvanbenschoten committed Aug 6, 2018
    Configuration menu
    Copy the full SHA
    7ffc9e1 View commit details
    Browse the repository at this point in the history