-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: deflake TestStoreMetrics #7809
Conversation
Partially reviewed, will resume later. Reviewed 4 of 5 files at r1. storage/client_metrics_test.go, line 82 [r1] (raw file):
danger zone storage/client_metrics_test.go, line 102 [r1] (raw file):
since you're here, uncapitalize these storage/client_metrics_test.go, line 132 [r1] (raw file):
this comment is truly worthless storage/client_test.go, line 544 [r1] (raw file):
remove? Comments from Reviewable |
Review status: 2 of 5 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending. storage/client_metrics_test.go, line 82 [r1] (raw file):
|
Reviewed 1 of 5 files at r1, 2 of 3 files at r2. storage/client_metrics_test.go, line 82 [r1] (raw file):
|
Reviewed 2 of 5 files at r1, 1 of 3 files at r2. storage/client_test.go, line 544 [r2] (raw file):
clientStopper is kind of an anachronism - prior to #4926 there was only one distSender so it got its own stopper. Now that each store gets its own distSender, those distSenders should probably use the store's stopper as well. I'm not sure what pulling on that thread would do, but it might be a way around this extra channel and goroutine. Comments from Reviewable |
Review status: 4 of 5 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending. storage/client_test.go, line 547 [r1] (raw file):
|
Reviewed 1 of 1 files at r4. storage/client_test.go, line 486 [r4] (raw file):
this pattern seems valid only when Comments from Reviewable |
Review status: 4 of 5 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending. storage/client_test.go, line 486 [r4] (raw file):
|
Reviewed 1 of 1 files at r5. Comments from Reviewable |
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks pending. storage/client_test.go, line 486 [r4] (raw file):
|
This was a tough one. Several problems were addressed, all variations on the same theme: - DistSenders in multiTestContext use a shared global stopper, but they may be called on goroutines which belong to a Store-level task. If that Store wants to quiesce and the DistSender can't finish its task because that same Store is already in quiescing mode, deadlocks occurred. The unfortunate solution is plugging in a channel which draws from two Stoppers, one of which may be quiesced and replaced multiple times. - Additional deadlocks were caused due to multiTestContext's transport, which acquired a read lock that was formerly held in write mode throughout mtc.stopStore() (circumvented by dropping the lock there while quiescing). - verifyStats was stopping individual Stores to perform computations without moving parts. Stopping individual Stores is tough when their tasks may be stuck on other Stores but can't complete while their own Store is already quiescing. Instead, verifyStats stops *all stores* simultaneously, regardless of which Store is actively being investigated. Prior to these changes, failed in a few hundred to a few thousand iters (depending on how many of the above were partially addressed): ``` $ make stressrace PKG=./storage TESTS=TestStoreMetrics TESTTIMEOUT=10s STRESSFLAGS='-maxfails 1 -stderr -p 128 -timeout 15m' 15784 runs so far, 0 failures, over 8m0s ``` Fixes cockroachdb#7678.
Review status: 4 of 5 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending. storage/client_test.go, line 486 [r4] (raw file):
|
This was a tough one. Several problems were addressed, all variations on the
same theme:
may be called on goroutines which belong to a Store-level task. If that
Store wants to quiesce and the DistSender can't finish its task because
that same Store is already in quiescing mode, deadlocks occurred.
The unfortunate solution is plugging in a channel which draws from two
Stoppers, one of which may be quiesced and replaced multiple times.
which acquired a read lock that was formerly held in write mode throughout
mtc.stopStore() (circumvented by dropping the lock there while quiescing).
moving parts. Stopping individual Stores is tough when their tasks may be
stuck on other Stores but can't complete while their own Store is already
quiescing. Instead, verifyStats stops all stores simultaneously, regardless
of which Store is actively being investigated.
Prior to these changes, failed in a few hundred to a few thousand iters
(depending on how many of the above were partially addressed):
Fixes #7678.
This change is