Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=true/rangeTs=true failed #104697

Closed
cockroach-teamcity opened this issue Jun 10, 2023 · 3 comments · Fixed by #104699
Closed

roachtest: clearrange/checks=true/rangeTs=true failed #104697

cockroach-teamcity opened this issue Jun 10, 2023 · 3 comments · Fixed by #104699
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team X-duplicate Closed as a duplicate of another issue.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 10, 2023

roachtest.clearrange/checks=true/rangeTs=true failed with artifacts on master @ 61806f42e4c833b863ca9c2f62ce34918f1c8277:

test artifacts and logs in: /artifacts/clearrange/checks=true/rangeTs=true/run_1
(cluster.go:2276).Run: output in run_124127.557601805_n1_cockroach-workload-i: ./cockroach workload init kv returned: context canceled
(monitor.go:137).Wait: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 7)
(test_runner.go:1154).func1: 1 dead node(s) detected

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-28675

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 10, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 10, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label Jun 10, 2023
@irfansharif irfansharif self-assigned this Jun 10, 2023
@irfansharif
Copy link
Contributor

Same as #104696 (comment).

@irfansharif irfansharif removed the T-storage Storage Team label Jun 10, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 10, 2023
craig bot pushed a commit that referenced this issue Jun 10, 2023
104699: kvserver: fix clearrange/* tests r=irfansharif a=irfansharif

Fixes #104696.
Fixes #104697.
Fixes #104698.
Part of #98703.

In 072c16d (added as part of #95637) we re-worked the locking structure around the RaftTransport's per-RPC class level send queues. When new send queues are instantiated or old ones deleted, we now also maintain the kvflowcontrol connection tracker, so such maintenance now needs to happen while holding a kvflowcontrol mutex. When rebasing \#95637 onto master, we accidentally included earlier queue deletion code without holding the appropriate mutex. Queue deletions now happened twice which made it possible to hit a RaftTransport assertion about expecting the right send queue to already exist.

Specifically, the following sequence was possible:
- `(*RaftTransport).SendAsync` is invoked, observes no queue for `<nodeid,class>`, creates it, and tracks it in the queues map.
  - It invokes an async worker W1 to process that send queue through `(*RaftTransport).startProcessNewQueue`. The async worker is responsible for clearing the tracked queue in the queues map once done.
- W1 expects to find the tracked queue in the queues map, finds it, proceeds.
- W1 is done processing. On its way out, W1 clears `<nodeid,class>` from the queues map the first time.
- `(*RaftTransport).SendAsync` is invoked by another goroutine, observes no queue for <nodeid,class>, creates it, and tracks it in the queues map.
  - It invokes an async worker W2 to process that send queue through `(*RaftTransport).startProcessNewQueue`. The async worker is responsible for clearing the tracked queue in the queues map once done.
- W1 blindly clears the `<nodeid,class>` raft send queue the second time.
- W2 expects to find the queue in the queues map, but doesn't, and fatals.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
@craig craig bot closed this as completed in 154b8d2 Jun 10, 2023
@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jun 13, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 13, 2023

cc @cockroachdb/replication

@pav-kv pav-kv added X-duplicate Closed as a duplicate of another issue. T-kv-replication and removed T-kv-replication labels Jun 13, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 13, 2023

cc @cockroachdb/replication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team X-duplicate Closed as a duplicate of another issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants