Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backport-2.0: storage: prevent unbounded raft log growth without quorum #27868

Merged
merged 6 commits into from
Aug 6, 2018

Conversation

nvanbenschoten
Copy link
Member

Backport 2/2 commits from #27774.

/cc @cockroachdb/release


Fixes #27772.

This change adds safeguards to prevent cases where a raft log
would grow without bound during loss of quorum scenarios. It
also adds a new test that demonstrates that the raft log does
not grow without bound in these cases.

There are two cases that need to be handled to prevent the
unbounded raft log growth observed in #27772.

  1. When the leader proposes a command and cannot establish a
    quorum. In this case, we know the leader has the entry in
    its log, so there's no need to refresh it with reasonTicks.
    To avoid this, we no longer use refreshTicks as a leader.
  2. When a follower proposes a command that is forwarded to the
    leader who cannot establish a quorum. In this case, the
    follower can't be sure (currently) that the leader got the
    proposal, so it needs to refresh using reasonTicks. However,
    the leader now detects duplicate forwarded proposals and
    avoids appending redundant entries to its log. It does so
    by maintaining a set of in-flight forwarded proposals that
    it has received during its term as leader. This set is reset
    after every leadership change.

Both of these cases are tested against in the new
TestLogGrowthWhenRefreshingPendingCommands. Without both of
the safeguards introduced in this commit, the test fails.

Release note (bug fix): Prevent loss of quorum situations from
allowing unbounded growth of a Range's Raft log.

@nvanbenschoten nvanbenschoten requested review from bdarnell and a team July 23, 2018 23:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

I've been delaying this until all of the fallout from #27774 is addressed. None of it has been with the actual approach here, but there was a tail of flakes (#27888 and #27920) revealed by the test introduced in this change.

@bdarnell
Copy link
Contributor

bdarnell commented Aug 6, 2018

Is this ready to merge? We want to get it in to 2.0.5 (which starts qualifying today).

@nvanbenschoten
Copy link
Member Author

I've included #27888 and #27920 in the backport, so this won't be introducing a flaky test. PTAL.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 5 files at r1, 3 of 3 files at r2, 3 of 5 files at r4, 2 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Fixes cockroachdb#27772.

This change adds safeguards to prevent cases where a raft log
would grow without bound during loss of quorum scenarios. It
also adds a new test that demonstrates that the raft log does
not grow without bound in these cases.

There are two cases that need to be handled to prevent the
unbounded raft log growth observed in cockroachdb#27772.
1. When the leader proposes a command and cannot establish a
   quorum. In this case, we know the leader has the entry in
   its log, so there's no need to refresh it with `reasonTicks`.
   To avoid this, we no longer use `refreshTicks` as a leader.
2. When a follower proposes a command that is forwarded to the
   leader who cannot establish a quorum. In this case, the
   follower can't be sure (currently) that the leader got the
   proposal, so it needs to refresh using `reasonTicks`. However,
   the leader now detects duplicate forwarded proposals and
   avoids appending redundant entries to its log. It does so
   by maintaining a set of in-flight forwarded proposals that
   it has received during its term as leader. This set is reset
   after every leadership change.

Both of these cases are tested against in the new
TestLogGrowthWhenRefreshingPendingCommands. Without both of
the safeguards introduced in this commit, the test fails.

Release note (bug fix): Prevent unbounded growth of the raft log
caused by a loss of quorum.
Fixes cockroachdb#27878.

The test was flaky because a NodeLiveness update was getting stuck when
killing a majority of nodes in a Range. The fix was to tie NodeLiveness'
update context to its stopper so that liveness updates are canceled when
their node begins to shut down.

This was a real issue that I suspect would occasionally make nodes hang on
shutdown when their liveness range was becoming unavailable. There are
probably other issues in the same class of bugs, but stressing
`TestLogGrowthWhenRefreshingPendingCommands` isn't showing anything else.

In doing so, the change needed to extend `stopper.WithCancel` into
`stopper.WithCancelOnQuiesce` and `stopper.WithCancelOnStop`.

Release note: None
Before this change, `WithCancelOnQuiesce` and `WithCancelOnStop` were
dangerous to use because they would never clean up memory. This meant
that any use of the methods that happened more than a constant number
of times would slowly leak memory. This was an issue in `client.Txn.rollback`.
This change fixes the methods so that it's possible for callers to clean
up after themselves.

Added a warning to `Stopper.AddCloser` because this had a similar
issue.

Release note: None
Before this change the Stopper made no guarantee that `Closers`
registered by the `AddCloser` method or `Contexts` tied to
the `Stopper` with `WithCancelOn{Quiesce,Stop}` were ever properly
cleaned up. In cases where these methods were called concurrently
with the Stopper stopping, it was possible for the resources to
leak. This change fixes this by handling cases where these methods
are called concurrently with or after the Stopper has stopped. This
allows them to provide stronger external guarantees.

Release note: None
@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 6, 2018

Build failed

@nvanbenschoten
Copy link
Member Author

Flake is #28228.

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 6, 2018

Build failed

@nvanbenschoten
Copy link
Member Author

Same flake. Retrying one more time before taking more drastic measures.

bors r+

craig bot pushed a commit that referenced this pull request Aug 6, 2018
27868: backport-2.0: storage: prevent unbounded raft log growth without quorum r=nvanbenschoten a=nvanbenschoten

Backport 2/2 commits from #27774.

/cc @cockroachdb/release

---

Fixes #27772.

This change adds safeguards to prevent cases where a raft log
would grow without bound during loss of quorum scenarios. It
also adds a new test that demonstrates that the raft log does
not grow without bound in these cases.

There are two cases that need to be handled to prevent the
unbounded raft log growth observed in #27772.
1. When the leader proposes a command and cannot establish a
   quorum. In this case, we know the leader has the entry in
   its log, so there's no need to refresh it with `reasonTicks`.
   To avoid this, we no longer use `refreshTicks` as a leader.
2. When a follower proposes a command that is forwarded to the
   leader who cannot establish a quorum. In this case, the
   follower can't be sure (currently) that the leader got the
   proposal, so it needs to refresh using `reasonTicks`. However,
   the leader now detects duplicate forwarded proposals and
   avoids appending redundant entries to its log. It does so
   by maintaining a set of in-flight forwarded proposals that
   it has received during its term as leader. This set is reset
   after every leadership change.

Both of these cases are tested against in the new
TestLogGrowthWhenRefreshingPendingCommands. Without both of
the safeguards introduced in this commit, the test fails.

Release note (bug fix): Prevent loss of quorum situations from
allowing unbounded growth of a Range's Raft log.


28225: release-2.0: importccl: Preserve '\r\n' during CSV import r=dt a=dt

Backport 1/1 commits from #28181.

/cc @cockroachdb/release

---

See #25344.


Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: neeral <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@craig
Copy link
Contributor

craig bot commented Aug 6, 2018

Build succeeded

@craig craig bot merged commit 7ffc9e1 into cockroachdb:release-2.0 Aug 6, 2018
@nvanbenschoten nvanbenschoten deleted the backport2.0-27774 branch August 9, 2018 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants