Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: deflake TestInitRaftGroupOnRequest #47625

Conversation

nvanbenschoten
Copy link
Member

Fixes #42808.
Fixes #44146.
Fixes #47020.
Fixes #47551.
Fixes #47231.

Disable async intent resolution. This can lead to flakiness in the test
because it allows for the intents written by the split transaction to be
resolved at any time, including after the nodes are restarted. The intent
resolution on the RHS's local range descriptor intent can both wake up
the RHS range's Raft group and result in the wrong replica acquiring the
lease.

I was always seeing this in conjunction with the log line:

kv/kvserver/intentresolver/intent_resolver.go:746  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer

Before the fix, the test failed almost immediately when stressed on a roachprod
cluster. After, I've never seen it flake:

576962 runs so far, 0 failures, over 19m35s

I think this may have gotten more flaky after we began batching intent
resolution, as this batching also introduced a delay to the async task.

I'll backport this to the past few release branches.

Fixes cockroachdb#42808.
Fixes cockroachdb#44146.
Fixes cockroachdb#47020.
Fixes cockroachdb#47551.
Fixes cockroachdb#47231.

Disable async intent resolution. This can lead to flakiness in the test
because it allows for the intents written by the split transaction to be
resolved at any time, including after the nodes are restarted. The intent
resolution on the RHS's local range descriptor intent can both wake up
the RHS range's Raft group and result in the wrong replica acquiring the
lease.

I was always seeing this in conjunction with the log line:
```
kv/kvserver/intentresolver/intent_resolver.go:746  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer
```

Before the fix, the test failed almost immediately when stressed on a roachprod
cluster. After, I've never seen it flake:
```
576962 runs so far, 0 failures, over 19m35s
```

I think this may have gotten more flaky after we began batching intent
resolution, as this batching also introduced a delay to the async task.

I'll backport this to the past few release branches.
@nvanbenschoten nvanbenschoten requested a review from tbg April 17, 2020 20:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 17, 2020

Build failed

@nvanbenschoten
Copy link
Member Author

#45337 flaked.

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 18, 2020

Build succeeded

@craig craig bot merged commit 25bcf0b into cockroachdb:master Apr 18, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/deflakeTestInitRaftGroupOnRequest branch April 23, 2020 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants