Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: deflake TestRaftRemoveRace #16399

Merged

Conversation

irfansharif
Copy link
Contributor

Fixes #16376.

Under stress TestRaftRemoveRace failed with:

--- FAIL: TestRaftRemoveRace (1.45s)
  client_test.go:1031: change replicas of r1 failed: quota pool no longer in use

Consider the following:

  • 'add replica' commands get queued up on the replicate queue
  • leader replica steps down as leader due to timeouts thus closing the
    quota pool
  • commands get out of the queue, cannot acquire quota because the quota
    pool is closed and fail with an error indicating so

TestRaftRemoveRace fails as it expects all replica additions to go
through without failure.

To reproduce this the following minimal test (run under stress) is
sufficient where we lower RaftTickInterval and RaftElectionTimeoutTicks
to make it more likely that leadership changes take place.

func TestReplicateQueueQuota(t *testing.T) {
  defer leaktest.AfterTest(t)()
    sc := storage.TestStoreConfig(nil)
    sc.RaftElectionTimeoutTicks = 2             // Default: 15
    sc.RaftTickInterval = 10 * time.Millisecond // Default: 200ms
    mtc := &multiTestContext{storeConfig: &sc}
    defer mtc.Stop()
    mtc.Start(t, 3)

    const rangeID = roachpb.RangeID(1)
    mtc.replicateRange(rangeID, 1, 2)

    for i := 0; i < 10; i++ {
        mtc.unreplicateRange(rangeID, 2)
        mtc.replicateRange(rangeID, 2)
    }
}

The earlier version of TestRaftRemoveRace was written to reproduce the
failure seen in #9037, this was agnostic to raft leadership changes.

Fixes cockroachdb#16376.

Under stress TestRaftRemoveRace failed with:

  --- FAIL: TestRaftRemoveRace (1.45s)
  client_test.go:1031: change replicas of r1 failed: quota pool no
  longer in use

Consider the following:
- 'add replica' commands get queued up on the replicate queue
- leader replica steps down as leader due to timeouts thus closing the
  quota pool
- commands get out of the queue, cannot acquire quota because the quota
  pool is closed and fail with an error indicating so

TestRaftRemoveRace fails as it expects all replica additions to go
through without failure.

To reproduce this the following minimal test (run under stress) is
sufficient where we lower RaftTickInterval and RaftElectionTimeoutTicks
to make it more likely that leadership changes take place.

  func TestReplicateQueueQuota(t *testing.T) {
    defer leaktest.AfterTest(t)()
      sc := storage.TestStoreConfig(nil)
      sc.RaftElectionTimeoutTicks = 2             // Default: 15
      sc.RaftTickInterval = 10 * time.Millisecond // Default: 200ms
      mtc := &multiTestContext{storeConfig: &sc}
      defer mtc.Stop()
      mtc.Start(t, 3)

      const rangeID = roachpb.RangeID(1)
      mtc.replicateRange(rangeID, 1, 2)

      for i := 0; i < 10; i++ {
          mtc.unreplicateRange(rangeID, 2)
          mtc.replicateRange(rangeID, 2)
      }
  }

The earlier version of TestRaftRemoveRace was written to reproduce the
failure seen in cockroachdb#9037, this was agnostic to raft leadership changes.
@irfansharif irfansharif requested a review from bdarnell June 7, 2017 22:32
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator

:lgtm:

Before the quota pool additions this wasn't a problem because we'd propose the command despite the replica not being the leader?


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@irfansharif
Copy link
Contributor Author

yup, TFTR.


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@irfansharif irfansharif merged commit 5bef6b6 into cockroachdb:master Jun 8, 2017
@irfansharif irfansharif deleted the TestRaftRemoveRace-flakiness branch June 8, 2017 14:52
@bdarnell
Copy link
Contributor

bdarnell commented Jun 8, 2017

We require that raft commands are proposed on the lease holder, not the raft leader (the lease holder will forward the proposal to the leader when this occurs). That's what was happening in the test (the ChangeReplicas transaction was always sent to the lease holder). I think this test was uncovering an actual bug in the quota pool.


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@irfansharif
Copy link
Contributor Author

Re-opened #16376 while I think about this some more.


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jun 8, 2017
(actually) Fixes cockroachdb#16376, reverts cockroachdb#16399.

TestRaftRemoveRace touched the short window of time where it was
possible that the lease holder and the raft leader were not the same
replica (raft leadership could change from under us, but the lease
holder stayed steady).

Consider the following sequence of events:
- the lease holder and the raft leader are co-located
- 'add replica' commands get queued up on the replicate queue
- leader replica steps down as leader thus closing the quota pool (on
  the leaseholder, because they're one and the same)
- commands get out of the queue, cannot acquire quota because the quota pool is
  closed (on the lease holder) and fail with an error indicating so

We make two observations:
- quotaPool.close() only takes place when a raft leader is becoming a
follower and thus causing all ongoing acquisitions to fail
- Ongoing acquisitions are only taking place on the lease holder replica

The quota pool was implemented in a manner such that it is effectively
disabled when the lease holder and the range leader are not co-located.
Failing with an error here (now that the raft leader has changed, the
lease holder and raft leader are no longer co-located) runs contrary to this.
What we really want is to "fail open" in this case instead, i.e. allow the
acquisition to proceed as if the quota pool is effectively disabled.
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jun 9, 2017
(actually) Fixes cockroachdb#16376, reverts cockroachdb#16399.

TestRaftRemoveRace touched the short window of time where it was
possible that the lease holder and the raft leader were not the same
replica (raft leadership could change from under us, but the lease
holder stayed steady).

Consider the following sequence of events:
- the lease holder and the raft leader are co-located
- 'add replica' commands get queued up on the replicate queue
- leader replica steps down as leader thus closing the quota pool (on
  the leaseholder, because they're one and the same)
- commands get out of the queue, cannot acquire quota because the quota pool is
  closed (on the lease holder) and fail with an error indicating so

We make two observations:
- quotaPool.close() only takes place when a raft leader is becoming a
follower and thus causing all ongoing acquisitions to fail
- Ongoing acquisitions are only taking place on the lease holder replica

The quota pool was implemented in a manner such that it is effectively
disabled when the lease holder and the range leader are not co-located.
Failing with an error here (now that the raft leader has changed, the
lease holder and raft leader are no longer co-located) runs contrary to this.
What we really want is to "fail open" in this case instead, i.e. allow the
acquisition to proceed as if the quota pool is effectively disabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: TestRaftRemoveRace failed under stress
4 participants