Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: deflake TestStoreMetrics #7809

Merged
merged 4 commits into from
Jul 14, 2016
Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Jul 13, 2016

This was a tough one. Several problems were addressed, all variations on the
same theme:

  • DistSenders in multiTestContext use a shared global stopper, but they
    may be called on goroutines which belong to a Store-level task. If that
    Store wants to quiesce and the DistSender can't finish its task because
    that same Store is already in quiescing mode, deadlocks occurred.
    The unfortunate solution is plugging in a channel which draws from two
    Stoppers, one of which may be quiesced and replaced multiple times.
  • Additional deadlocks were caused due to multiTestContext's transport,
    which acquired a read lock that was formerly held in write mode throughout
    mtc.stopStore() (circumvented by dropping the lock there while quiescing).
  • verifyStats was stopping individual Stores to perform computations without
    moving parts. Stopping individual Stores is tough when their tasks may be
    stuck on other Stores but can't complete while their own Store is already
    quiescing. Instead, verifyStats stops all stores simultaneously, regardless
    of which Store is actively being investigated.

Prior to these changes, failed in a few hundred to a few thousand iters
(depending on how many of the above were partially addressed):

$ make stressrace PKG=./storage TESTS=TestStoreMetrics TESTTIMEOUT=10s STRESSFLAGS='-maxfails 1 -stderr -p 128 -timeout 15m'
15784 runs so far, 0 failures, over 8m0s

Fixes #7678.


This change is Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 13, 2016

Partially reviewed, will resume later.


Reviewed 4 of 5 files at r1.
Review status: 4 of 5 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


storage/client_metrics_test.go, line 82 [r1] (raw file):

  // least when bringing down the nodes jeopardizes majorities.
  for i := 0; i < numStores; i++ {
      go func(i int) {

danger zone


storage/client_metrics_test.go, line 102 [r1] (raw file):

      // Sanity regression check for bug #4624: ensure intent count is zero.
      if a := realStats.IntentCount; a != 0 {
          fatalf("Expected intent count to be zero, was %d", a)

since you're here, uncapitalize these


storage/client_metrics_test.go, line 132 [r1] (raw file):

  for i := 0; i < numStores; i++ {
      // Restart the store at the provided index.

this comment is truly worthless


storage/client_test.go, line 544 [r1] (raw file):

  if len(m.dbs) <= idx {
      retryOpts := base.DefaultRetryOptions()
      retryOpts.Closer = m.clientStopper.ShouldQuiesce()

remove?


Comments from Reviewable

@tbg tbg force-pushed the fix-store-metrics branch from 46af5e3 to 8f31839 Compare July 13, 2016 21:15
@tbg
Copy link
Member Author

tbg commented Jul 13, 2016

Review status: 2 of 5 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


storage/client_metrics_test.go, line 82 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

danger zone

?

storage/client_metrics_test.go, line 102 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

since you're here, uncapitalize these

Done.

storage/client_metrics_test.go, line 132 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this comment is truly worthless

Done.

storage/client_test.go, line 544 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

remove?

Done.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 13, 2016

Reviewed 1 of 5 files at r1, 2 of 3 files at r2.
Review status: 4 of 5 files reviewed at latest revision, 8 unresolved discussions, some commit checks pending.


storage/client_metrics_test.go, line 82 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

?

Nothing, this just looks like a race condition to me, but I guess that's what we're dealing with here.

storage/client_test.go, line 547 [r1] (raw file):

      retryOpts.Closer = func() chan struct{} {
          ch := make(chan struct{})
          // Feed the channel periodically as long as our "real" stopper

this comment doesn't help me understand why it's ok to "feed" the closer channel - this is not a semantic we've ever used before, and it's not clear what's being achieved by doing this (there's no way to control who's on the other side of this "feeding").


storage/client_test.go, line 563 [r1] (raw file):

                      return true
                  }
                  return false

nit: more readable in the ch <- struct{}{} case


storage/client_test.go, line 571 [r1] (raw file):

                  for grabbedStopper != nil {
                      m.mu.RLock()
                      if len(m.stoppers) <= idx {

how come this check is needed?


storage/client_test.go, line 683 [r1] (raw file):

func (m *multiTestContext) stopStore(i int) {
  m.mu.Lock()
  // Stopping when multiple stoppers (which are not aware of each other) is

grammar


storage/client_test.go, line 688 [r1] (raw file):

  // it's already in a task, so if we simply grabbed a write lock here while
  // stopping we could deadlock (see #7678).
  // So we initiate stopping under a write lock, and then release the lock

So we initiate quiescing under a write lock, and then release the lock during stopping.


storage/client_test.go, line 692 [r1] (raw file):

  stopper := m.stoppers[i]
  m.stoppers[i] = nil
  go func() {

go stopper.Quiesce()?


storage/client_test.go, line 700 [r1] (raw file):

  m.mu.Lock()
  defer m.mu.Unlock()

nit: this defer looks awkward now, i'd just move it down


Comments from Reviewable

@bdarnell
Copy link
Contributor

Reviewed 2 of 5 files at r1, 1 of 3 files at r2.
Review status: 4 of 5 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending.


storage/client_test.go, line 544 [r2] (raw file):

  if len(m.dbs) <= idx {
      retryOpts := base.DefaultRetryOptions()
      retryOpts.Closer = func() chan struct{} {

clientStopper is kind of an anachronism - prior to #4926 there was only one distSender so it got its own stopper. Now that each store gets its own distSender, those distSenders should probably use the store's stopper as well. I'm not sure what pulling on that thread would do, but it might be a way around this extra channel and goroutine.


Comments from Reviewable

@tbg tbg force-pushed the fix-store-metrics branch from 8f31839 to 74ceb67 Compare July 14, 2016 07:22
@tbg
Copy link
Member Author

tbg commented Jul 14, 2016

Review status: 4 of 5 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending.


storage/client_test.go, line 547 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this comment doesn't help me understand why it's ok to "feed" the closer channel - this is not a semantic we've ever used before, and it's not clear what's being achieved by doing this (there's no way to control who's on the other side of this "feeding").

This simulates a channel which is closed when either of two parent channels is closed (or nil). Horrible stuff and I hope that with Ben's comment above addressed, everything else remains stable and we can forget about this part.

storage/client_test.go, line 688 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

So we initiate quiescing under a write lock, and then release the lock during stopping.

Done.

storage/client_test.go, line 692 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

go stopper.Quiesce()?

Done.

storage/client_test.go, line 700 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: this defer looks awkward now, i'd just move it down

`RemoveStore` could panic (or at least I don't want to assume it can't).

storage/client_test.go, line 544 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

clientStopper is kind of an anachronism - prior to #4926 there was only one distSender so it got its own stopper. Now that each store gets its own distSender, those distSenders should probably use the store's stopper as well. I'm not sure what pulling on that thread would do, but it might be a way around this extra channel and goroutine.

Good call, PTAL.

Comments from Reviewable

@tbg tbg force-pushed the fix-store-metrics branch from 74ceb67 to 2d5c500 Compare July 14, 2016 08:19
@tamird
Copy link
Contributor

tamird commented Jul 14, 2016

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


storage/client_test.go, line 486 [r4] (raw file):

func (m *multiTestContext) populateDB(idx int, stopper *stop.Stopper) {
  if len(m.dbs) <= idx {

this pattern seems valid only when len(m.dbs) == idx, and yet it's peppered throughout this file =/


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 14, 2016

Review status: 4 of 5 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/client_test.go, line 486 [r4] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this pattern seems valid only when len(m.dbs) == idx, and yet it's peppered throughout this file =/

PTAL

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 14, 2016

:lgtm:


Reviewed 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 14, 2016

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/client_test.go, line 486 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

PTAL

Have to look at this again actually. All of this opportunistic appending was done for a reason as some tests inject some of the items prior to the test starting (store context, clocks, ...). I'm going to put in some assertions that make sure we only override what can be overridden.

Comments from Reviewable

tbg added 4 commits July 14, 2016 06:10
This was a tough one. Several problems were addressed, all variations on the
same theme:

- DistSenders in multiTestContext use a shared global stopper, but they
  may be called on goroutines which belong to a Store-level task. If that
  Store wants to quiesce and the DistSender can't finish its task because
  that same Store is already in quiescing mode, deadlocks occurred.
  The unfortunate solution is plugging in a channel which draws from two
  Stoppers, one of which may be quiesced and replaced multiple times.
- Additional deadlocks were caused due to multiTestContext's transport,
  which acquired a read lock that was formerly held in write mode throughout
  mtc.stopStore() (circumvented by dropping the lock there while quiescing).
- verifyStats was stopping individual Stores to perform computations without
  moving parts. Stopping individual Stores is tough when their tasks may be
  stuck on other Stores but can't complete while their own Store is already
  quiescing. Instead, verifyStats stops *all stores* simultaneously, regardless
  of which Store is actively being investigated.

Prior to these changes, failed in a few hundred to a few thousand iters
(depending on how many of the above were partially addressed):

```
$ make stressrace PKG=./storage TESTS=TestStoreMetrics TESTTIMEOUT=10s STRESSFLAGS='-maxfails 1 -stderr -p 128 -timeout 15m'
15784 runs so far, 0 failures, over 8m0s
```

Fixes cockroachdb#7678.
@tbg tbg force-pushed the fix-store-metrics branch from babf2d2 to da75979 Compare July 14, 2016 10:11
@tbg
Copy link
Member Author

tbg commented Jul 14, 2016

Review status: 4 of 5 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/client_test.go, line 486 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Have to look at this again actually. All of this opportunistic appending was done for a reason as some tests inject some of the items prior to the test starting (store context, clocks, ...).
I'm going to put in some assertions that make sure we only override what can be overridden.

Ok, good to go now with last commit.

Comments from Reviewable

@tbg tbg merged commit 34fc712 into cockroachdb:master Jul 14, 2016
@tbg tbg deleted the fix-store-metrics branch July 14, 2016 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants