Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestInitRaftGroupOnRequest failed #47551

Closed
cockroach-teamcity opened this issue Apr 16, 2020 · 1 comment · Fixed by #47625
Closed

kv/kvserver: TestInitRaftGroupOnRequest failed #47551

cockroach-teamcity opened this issue Apr 16, 2020 · 1 comment · Fixed by #47625
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(kv/kvserver).TestInitRaftGroupOnRequest failed on provisional_202004152236_v20.1.0-rc.2@778550737c285409cc1d9c6d96584fc985071c79:

=== RUN   TestInitRaftGroupOnRequest
I200416 00:41:42.532617 201874 gossip/gossip.go:395  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:43921" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 cluster_name:"" sql_address:<network_field:"" address_field:"" > 
W200416 00:41:42.664153 201874 gossip/gossip.go:1513  [n2] no incoming or outgoing connections
I200416 00:41:42.667307 201874 gossip/gossip.go:395  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:42651" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 cluster_name:"" sql_address:<network_field:"" address_field:"" > 
I200416 00:41:42.680184 201960 gossip/client.go:124  [n2] started gossip client to 127.0.0.1:43921
I200416 00:41:42.798894 201874 kv/kvserver/replica_command.go:397  [s1,r1/1:/M{in-ax}] initiating a split of this range at key /Table/50 [r2] (manual)
I200416 00:41:42.876444 201874 kv/kvserver/replica_command.go:1602  [s1,r2/1:/{Table/50-Max}] change replicas (add [(n2,s2):2LEARNER] remove []): existing descriptor r2:/{Table/50-Max} [(n1,s1):1, next=2, gen=1]
I200416 00:41:42.902066 201874 kv/kvserver/replica_raft.go:247  [s1,r2/1:/{Table/50-Max},txn=731cbfae] proposing SIMPLE(l2) ADD_REPLICA[(n2,s2):2LEARNER]: after=[(n1,s1):1 (n2,s2):2LEARNER] next=3
I200416 00:41:42.955133 201874 kv/kvserver/store_snapshot.go:932  [s1,r2/1:/{Table/50-Max}] sending LEARNER snapshot 51314550 at applied index 14
I200416 00:41:42.957069 201874 kv/kvserver/store_snapshot.go:975  [s1,r2/1:/{Table/50-Max}] streamed snapshot to (n2,s2):2LEARNER: kv pairs: 6, log entries: 0, rate-limit: 8.0 MiB/sec, 0.02s
I200416 00:41:42.964347 202234 kv/kvserver/replica_raftstorage.go:780  [s2,r2/2:{-}] applying LEARNER snapshot [id=51314550 index=14]
I200416 00:41:42.980702 202211 kv/kvserver/raft_snapshot_queue.go:126  [raftsnapshot,s1,r2/1:/{Table/50-Max}] skipping snapshot; replica is likely a learner in the process of being added: (n2,s2):2LEARNER
I200416 00:41:42.987315 202234 kv/kvserver/replica_raftstorage.go:801  [s2,r2/2:/{Table/50-Max}] applied LEARNER snapshot [total=22ms ingestion=4@18ms id=51314550 index=14]
I200416 00:41:43.004059 201874 kv/kvserver/replica_command.go:1602  [s1,r2/1:/{Table/50-Max}] change replicas (add [(n2,s2):2] remove []): existing descriptor r2:/{Table/50-Max} [(n1,s1):1, (n2,s2):2LEARNER, next=3, gen=2]
I200416 00:41:43.060218 201874 kv/kvserver/replica_raft.go:247  [s1,r2/1:/{Table/50-Max},txn=8edbbb7c] proposing SIMPLE(v2) ADD_REPLICA[(n2,s2):2]: after=[(n1,s1):1 (n2,s2):2] next=3
I200416 00:41:43.148540 201874 util/stop/stopper.go:539  quiescing
W200416 00:41:43.163652 202285 kv/kvserver/intentresolver/intent_resolver.go:745  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer
W200416 00:41:43.191006 202039 kv/kvserver/store.go:1541  [s1,r1/1:/{Min-Table/50}] could not gossip first range descriptor: node unavailable; try another peer
I200416 00:41:43.228577 201874 util/stop/stopper.go:539  quiescing
W200416 00:41:43.244155 202198 kv/txn.go:603  [liveness-hb] failure aborting transaction: node unavailable; try another peer; abort caused by: node unavailable; try another peer
W200416 00:41:43.244539 202198 kv/kvserver/node_liveness.go:471  [liveness-hb] failed node liveness heartbeat: node unavailable; try another peer
I200416 00:41:43.708939 202667 util/stop/stopper.go:539  quiescing
I200416 00:41:43.709581 202668 util/stop/stopper.go:539  quiescing
I200416 00:41:43.770157 202666 util/stop/stopper.go:539  quiescing
W200416 00:41:43.776863 202235 kv/kvserver/raft_transport.go:637  while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
W200416 00:41:43.781537 201970 kv/kvserver/raft_transport.go:637  while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W200416 00:41:43.782659 202209 kv/kvserver/raft_transport.go:637  while processing outgoing Raft queue to node 1: EOF:
I200416 00:41:43.799518 202666 util/stop/stopper.go:539  quiescing
I200416 00:41:43.802667 202666 util/stop/stopper.go:539  quiescing
--- FAIL: TestInitRaftGroupOnRequest (1.37s)
    client_raft_test.go:3968: expected NotLeaseHolderError; got <nil>

More

Parameters:

  • GOFLAGS=-json
make stressrace TESTS=TestInitRaftGroupOnRequest PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. branch-provisional_202004152236_v20.1.0-rc.2 labels Apr 16, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Apr 16, 2020
@irfansharif
Copy link
Contributor

Duplicate #47231.

@tbg tbg assigned tbg and unassigned spencerkimball Apr 16, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 17, 2020
Fixes cockroachdb#42808.
Fixes cockroachdb#44146.
Fixes cockroachdb#47020.
Fixes cockroachdb#47551.
Fixes cockroachdb#47231.

Disable async intent resolution. This can lead to flakiness in the test
because it allows for the intents written by the split transaction to be
resolved at any time, including after the nodes are restarted. The intent
resolution on the RHS's local range descriptor intent can both wake up
the RHS range's Raft group and result in the wrong replica acquiring the
lease.

I was always seeing this in conjunction with the log line:
```
kv/kvserver/intentresolver/intent_resolver.go:746  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer
```

Before the fix, the test failed almost immediately when stressed on a roachprod
cluster. After, I've never seen it flake:
```
576962 runs so far, 0 failures, over 19m35s
```

I think this may have gotten more flaky after we began batching intent
resolution, as this batching also introduced a delay to the async task.

I'll backport this to the past few release branches.
craig bot pushed a commit that referenced this issue Apr 18, 2020
47625: kv: deflake TestInitRaftGroupOnRequest r=nvanbenschoten a=nvanbenschoten

Fixes #42808.
Fixes #44146.
Fixes #47020.
Fixes #47551.
Fixes #47231.

Disable async intent resolution. This can lead to flakiness in the test
because it allows for the intents written by the split transaction to be
resolved at any time, including after the nodes are restarted. The intent
resolution on the RHS's local range descriptor intent can both wake up
the RHS range's Raft group and result in the wrong replica acquiring the
lease.

I was always seeing this in conjunction with the log line:
```
kv/kvserver/intentresolver/intent_resolver.go:746  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer
```

Before the fix, the test failed almost immediately when stressed on a roachprod
cluster. After, I've never seen it flake:
```
576962 runs so far, 0 failures, over 19m35s
```

I think this may have gotten more flaky after we began batching intent
resolution, as this batching also introduced a delay to the async task.

I'll backport this to the past few release branches.

Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 18, 2020
Fixes cockroachdb#42808.
Fixes cockroachdb#44146.
Fixes cockroachdb#47020.
Fixes cockroachdb#47551.
Fixes cockroachdb#47231.

Disable async intent resolution. This can lead to flakiness in the test
because it allows for the intents written by the split transaction to be
resolved at any time, including after the nodes are restarted. The intent
resolution on the RHS's local range descriptor intent can both wake up
the RHS range's Raft group and result in the wrong replica acquiring the
lease.

I was always seeing this in conjunction with the log line:
```
kv/kvserver/intentresolver/intent_resolver.go:746  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer
```

Before the fix, the test failed almost immediately when stressed on a roachprod
cluster. After, I've never seen it flake:
```
576962 runs so far, 0 failures, over 19m35s
```

I think this may have gotten more flaky after we began batching intent
resolution, as this batching also introduced a delay to the async task.

I'll backport this to the past few release branches.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 18, 2020
Fixes cockroachdb#42808.
Fixes cockroachdb#44146.
Fixes cockroachdb#47020.
Fixes cockroachdb#47551.
Fixes cockroachdb#47231.

Disable async intent resolution. This can lead to flakiness in the test
because it allows for the intents written by the split transaction to be
resolved at any time, including after the nodes are restarted. The intent
resolution on the RHS's local range descriptor intent can both wake up
the RHS range's Raft group and result in the wrong replica acquiring the
lease.

I was always seeing this in conjunction with the log line:
```
kv/kvserver/intentresolver/intent_resolver.go:746  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/50/RangeDescriptor: node unavailable; try another peer
```

Before the fix, the test failed almost immediately when stressed on a roachprod
cluster. After, I've never seen it flake:
```
576962 runs so far, 0 failures, over 19m35s
```

I think this may have gotten more flaky after we began batching intent
resolution, as this batching also introduced a delay to the async task.

I'll backport this to the past few release branches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants