Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.1: kv: deflake TestInitRaftGroupOnRequest #47664

Merged

Conversation

nvanbenschoten
Copy link
Member

Backport 1/1 commits from #47625.

/cc @cockroachdb/release


Fixes #42808.
Fixes #44146.
Fixes #47020.
Fixes #47551.
Fixes #47231.

Disable async intent resolution. This can lead to flakiness in the test
because it allows for the intents written by the split transaction to be
resolved at any time, including after the nodes are restarted. The intent
resolution on the RHS's local range descriptor intent can both wake up
the RHS range's Raft group and result in the wrong replica acquiring the
lease.

I was always seeing this in conjunction with the log line:

kv/kvserver/intentresolver/intent_resolver.go:746  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer

Before the fix, the test failed almost immediately when stressed on a roachprod
cluster. After, I've never seen it flake:

576962 runs so far, 0 failures, over 19m35s

I think this may have gotten more flaky after we began batching intent
resolution, as this batching also introduced a delay to the async task.

I'll backport this to the past few release branches.

Fixes cockroachdb#42808.
Fixes cockroachdb#44146.
Fixes cockroachdb#47020.
Fixes cockroachdb#47551.
Fixes cockroachdb#47231.

Disable async intent resolution. This can lead to flakiness in the test
because it allows for the intents written by the split transaction to be
resolved at any time, including after the nodes are restarted. The intent
resolution on the RHS's local range descriptor intent can both wake up
the RHS range's Raft group and result in the wrong replica acquiring the
lease.

I was always seeing this in conjunction with the log line:
```
kv/kvserver/intentresolver/intent_resolver.go:746  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer
```

Before the fix, the test failed almost immediately when stressed on a roachprod
cluster. After, I've never seen it flake:
```
576962 runs so far, 0 failures, over 19m35s
```

I think this may have gotten more flaky after we began batching intent
resolution, as this batching also introduced a delay to the async task.

I'll backport this to the past few release branches.
@nvanbenschoten nvanbenschoten requested a review from tbg April 18, 2020 19:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten merged commit 56b3d25 into cockroachdb:release-20.1 Apr 18, 2020
@nvanbenschoten nvanbenschoten deleted the backport20.1-47625 branch April 23, 2020 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants