-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: TestRaftRemoveRace failed under stress #16376
Comments
Fixes cockroachdb#16376. Under stress TestRaftRemoveRace failed with: --- FAIL: TestRaftRemoveRace (1.45s) client_test.go:1031: change replicas of r1 failed: quota pool no longer in use Consider the following: - 'add replica' commands get queued up on the replicate queue - leader replica steps down as leader due to timeouts thus closing the quota pool - commands get out of the queue, cannot acquire quota because the quota pool is closed and fail with an error indicating so TestRaftRemoveRace fails as it expects all replica additions to go through without failure. To reproduce this the following minimal test (run under stress) is sufficient where we lower RaftTickInterval and RaftElectionTimeoutTicks to make it more likely that leadership changes take place. func TestReplicateQueueQuota(t *testing.T) { defer leaktest.AfterTest(t)() sc := storage.TestStoreConfig(nil) sc.RaftElectionTimeoutTicks = 2 // Default: 15 sc.RaftTickInterval = 10 * time.Millisecond // Default: 200ms mtc := &multiTestContext{storeConfig: &sc} defer mtc.Stop() mtc.Start(t, 3) const rangeID = roachpb.RangeID(1) mtc.replicateRange(rangeID, 1, 2) for i := 0; i < 10; i++ { mtc.unreplicateRange(rangeID, 2) mtc.replicateRange(rangeID, 2) } } The earlier version of TestRaftRemoveRace was written to reproduce the failure seen in cockroachdb#9037, this was agnostic to raft leadership changes.
If I've understood correctly this touches the short window of time where it's
Re-opening while I think about this some more. ~~~At a first glance it // Raft may propose commands itself (specifically the empty commands when
// leadership changes), and these commands don't go through the code paths where
// we acquire quota from the pool. To offset this we reset the quota pool whenever
// leadership changes hands. |
From #16399:
|
Yes. As I said when the quota pool was introduced (#15802 (comment)), it's fine if the quota pool mechanism is effectively disabled when the leader and lease holder are not the same. We want to "fail open" in this case instead of returning an error. |
(actually) Fixes cockroachdb#16376, reverts cockroachdb#16399. TestRaftRemoveRace touched the short window of time where it was possible that the lease holder and the raft leader were not the same replica (raft leadership could change from under us, but the lease holder stayed steady). Consider the following sequence of events: - the lease holder and the raft leader are co-located - 'add replica' commands get queued up on the replicate queue - leader replica steps down as leader thus closing the quota pool (on the leaseholder, because they're one and the same) - commands get out of the queue, cannot acquire quota because the quota pool is closed (on the lease holder) and fail with an error indicating so We make two observations: - quotaPool.close() only takes place when a raft leader is becoming a follower and thus causing all ongoing acquisitions to fail - Ongoing acquisitions are only taking place on the lease holder replica The quota pool was implemented in a manner such that it is effectively disabled when the lease holder and the range leader are not co-located. Failing with an error here (now that the raft leader has changed, the lease holder and raft leader are no longer co-located) runs contrary to this. What we really want is to "fail open" in this case instead, i.e. allow the acquisition to proceed as if the quota pool is effectively disabled.
(actually) Fixes cockroachdb#16376, reverts cockroachdb#16399. TestRaftRemoveRace touched the short window of time where it was possible that the lease holder and the raft leader were not the same replica (raft leadership could change from under us, but the lease holder stayed steady). Consider the following sequence of events: - the lease holder and the raft leader are co-located - 'add replica' commands get queued up on the replicate queue - leader replica steps down as leader thus closing the quota pool (on the leaseholder, because they're one and the same) - commands get out of the queue, cannot acquire quota because the quota pool is closed (on the lease holder) and fail with an error indicating so We make two observations: - quotaPool.close() only takes place when a raft leader is becoming a follower and thus causing all ongoing acquisitions to fail - Ongoing acquisitions are only taking place on the lease holder replica The quota pool was implemented in a manner such that it is effectively disabled when the lease holder and the range leader are not co-located. Failing with an error here (now that the raft leader has changed, the lease holder and raft leader are no longer co-located) runs contrary to this. What we really want is to "fail open" in this case instead, i.e. allow the acquisition to proceed as if the quota pool is effectively disabled.
SHA: https://github.com/cockroachdb/cockroach/commits/e1ec2fd44f7cd23c1835c61e20070d64d2bf7f9c
Parameters:
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=266258&tab=buildLog
The text was updated successfully, but these errors were encountered: