Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: deflake TestRefreshPendingCommands #12425

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Dec 15, 2016

Node liveness heartbeats were causing a range lease to be reacquired
when the test was waiting for them to be drained. Due to the manual
clock, the draining would never properly complete.

Fixes #11771


This change is Reviewable

@tamird
Copy link
Contributor

tamird commented Dec 15, 2016

:lgtm:


Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Ha, I ran stress for 10m, then sent this PR and then the stress failed. Gah. There is something else that is also requesting the lease again.

@petermattis petermattis force-pushed the pmattis/deflake-test-refresh-pending-commands branch from a755b5c to 89881b7 Compare December 15, 2016 20:43
@petermattis
Copy link
Collaborator Author

Second attempt. I'm trying to spur the inevitable failure by proclaiming that the flakiness is really fixed this time. PTAL.

@petermattis
Copy link
Collaborator Author

Making progress:

~ make stress TAGS=deadlock PKG=./storage TESTS=TestRefreshPendingCommands STRESSFLAGS="-maxfails 1 -stderr -p 16"
...
    --- FAIL: TestRefreshPendingCommands/#00 (2.34s)
    	client_raft_test.go:1000: range 1: replica {3 3 3} not lease holder; current lease is repl={3 3 3} start=0.900000123,17 exp=2.700000125,42 pro=1.800000125,43
FAIL


ERROR: exit status 1

7310 runs completed, 1 failures, over 20m31s

Before this PR the above stress invocation would fail within a minute or two.

@spencerkimball, @andreimatei How could the above error occur?

@petermattis
Copy link
Collaborator Author

Consulted with @andreimatei in person and he doesn't know how this error is being generated. I've added more logging and will reproduce.

@petermattis
Copy link
Collaborator Author

PS We saw this error before. We eventually attributed the error to strangeness in multiTestContext and the use of a manual clock. We're not doing any explicit lease transfers in TestRefreshPendingCommands, but perhaps the replicate queue is.

Node liveness heartbeats were causing a range lease to be reacquired
when the test was waiting for them to be drained. Due to the manual
clock, the draining would never properly complete.

Fixes cockroachdb#11771
@petermattis petermattis force-pushed the pmattis/deflake-test-refresh-pending-commands branch from 691bf11 to 79838d5 Compare December 16, 2016 14:12
@petermattis
Copy link
Collaborator Author

Merging as this removes one source of flakiness from this test. I'll tackle the next bit of flakiness in a separate PR.

@petermattis petermattis merged commit 6559f07 into cockroachdb:master Dec 16, 2016
@petermattis petermattis deleted the pmattis/deflake-test-refresh-pending-commands branch December 16, 2016 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants