Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: re-enqueue Raft groups on paginated application #31568

Merged

Conversation

nvanbenschoten
Copy link
Member

Fixes #31330.

This change re-enqueues Raft groups for processing immediately if they
still have more to do after a Raft ready iteration. This comes up in
practice when a Range has sufficient load to force Raft application
pagination. See #31330 for a discussion on the symptoms this can
cause.

Release note (bug fix): Fix bug where Raft followers could fall behind
leaders will entry application, causing stalls during splits.

@nvanbenschoten nvanbenschoten requested review from bdarnell, petermattis and a team October 18, 2018 00:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Have you stressed this new test? Mucking with proposal application like that seems fragile. I don't have a better suggestion, though.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_test.go, line 9622 at r1 (raw file):

	repl := tc.repl

	// Propose a command to Raft and block its application.

Technically, you block application and then propose the command.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/divergentRaft branch from 88f4361 to 6666008 Compare October 18, 2018 01:31
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR.

Have you stressed this new test?

I stressed it for 10 minutes without issue.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_test.go, line 9622 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Technically, you block application and then propose the command.

Done.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/storage/replica_test.go, line 9586 at r2 (raw file):

// TestApplyPaginatedCommittedEntries tests that a Raft group's committed
// entries are quickly applied, even if their application is paginated due to
// the RaftMaxSizePerMsg configuration. This is a regession test for #31330.

nit: regession


pkg/storage/replica_test.go, line 9659 at r2 (raw file):

	// small RaftMaxSizePerMsg.
	close(blockRaftApplication)
	const maxWait = 5 * time.Second

5s might be too aggressive to apply 50 commands when stressed sufficiently. Perhaps bump this a bit. Or we'll just leave it and find out.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I'm OK with putting this in 2.1RC2.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale)

Fixes cockroachdb#31330.

This change re-enqueues Raft groups for processing immediately if they
still have more to do after a Raft ready iteration. This comes up in
practice when a Range has sufficient load to force Raft application
pagination. See cockroachdb#31330 for a discussion on the symptoms this can
cause.

Release note (bug fix): Fix bug where Raft followers could fall behind
leaders will entry application, causing stalls during splits.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/divergentRaft branch from 6666008 to 38b4827 Compare October 18, 2018 20:13
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs.

@bdarnell This is currently based on #31408, so the backport won't be completely clean because of the test. I'm assuming you're fine with that and still want to keep #31408 for 2.1.1.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_test.go, line 9586 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: regession

Done.


pkg/storage/replica_test.go, line 9659 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

5s might be too aggressive to apply 50 commands when stressed sufficiently. Perhaps bump this a bit. Or we'll just leave it and find out.

Bumped to 10s.

@nvanbenschoten
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request Oct 18, 2018
31568: storage: re-enqueue Raft groups on paginated application r=nvanbenschoten a=nvanbenschoten

Fixes #31330.

This change re-enqueues Raft groups for processing immediately if they
still have more to do after a Raft ready iteration. This comes up in
practice when a Range has sufficient load to force Raft application
pagination. See #31330 for a discussion on the symptoms this can
cause.

Release note (bug fix): Fix bug where Raft followers could fall behind
leaders will entry application, causing stalls during splits.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig
Copy link
Contributor

craig bot commented Oct 18, 2018

Build succeeded

@craig craig bot merged commit 38b4827 into cockroachdb:master Oct 18, 2018
@bdarnell
Copy link
Contributor

This is currently based on #31408, so the backport won't be completely clean because of the test. I'm assuming you're fine with that and still want to keep #31408 for 2.1.1.

Correct.

@nvanbenschoten
Copy link
Member Author

@bdarnell how do you feel about this being backported to 2.0?

@bdarnell
Copy link
Contributor

I think it's a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: divergent entry application can lead to request stalls on Range splits
5 participants