Skip to content

Commit

Permalink
kv: deflake TestInitRaftGroupOnRequest
Browse files Browse the repository at this point in the history
Fixes #42808.
Fixes #44146.
Fixes #47020.
Fixes #47551.
Fixes #47231.

Disable async intent resolution. This can lead to flakiness in the test
because it allows for the intents written by the split transaction to be
resolved at any time, including after the nodes are restarted. The intent
resolution on the RHS's local range descriptor intent can both wake up
the RHS range's Raft group and result in the wrong replica acquiring the
lease.

I was always seeing this in conjunction with the log line:
```
kv/kvserver/intentresolver/intent_resolver.go:746  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer
```

Before the fix, the test failed almost immediately when stressed on a roachprod
cluster. After, I've never seen it flake:
```
576962 runs so far, 0 failures, over 19m35s
```

I think this may have gotten more flaky after we began batching intent
resolution, as this batching also introduced a delay to the async task.

I'll backport this to the past few release branches.
  • Loading branch information
nvanbenschoten committed Apr 18, 2020
1 parent 67b3242 commit ca2a4f1
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions pkg/storage/client_raft_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3932,6 +3932,12 @@ func TestInitRaftGroupOnRequest(t *testing.T) {
defer leaktest.AfterTest(t)()
storeCfg := storage.TestStoreConfig(nil /* clock */)
storeCfg.TestingKnobs.DisableMergeQueue = true
// Disable async intent resolution. This can lead to flakiness in the test
// because it allows for the intents written by the split transaction to be
// resolved at any time, including after the nodes are restarted. The intent
// resolution on the RHS's local range descriptor can both wake up the RHS
// range's Raft group and result in the wrong replica acquiring the lease.
storeCfg.TestingKnobs.IntentResolverKnobs.DisableAsyncIntentResolution = true
mtc := &multiTestContext{
storeConfig: &storeCfg,
// TODO(andrei): This test was written before multiTestContexts started with
Expand Down Expand Up @@ -3963,6 +3969,8 @@ func TestInitRaftGroupOnRequest(t *testing.T) {
mtc.restart()

// Get replica from the store which isn't the leaseholder.
// NOTE: StoreID is 1-indexed and storeIdx is 0-indexed, so despite what
// this might look like, this is grabbing the replica without the lease.
storeIdx := int(lease.Replica.StoreID) % len(mtc.stores)
if repl = mtc.stores[storeIdx].LookupReplica(roachpb.RKey(splitKey)); repl == nil {
t.Fatal("replica should not be nil for RHS range")
Expand Down

0 comments on commit ca2a4f1

Please sign in to comment.