Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover/system-non-liveness/blackhole-send/lease=expiration failed #104694

Closed
cockroach-teamcity opened this issue Jun 10, 2023 · 4 comments · Fixed by #105190
Closed
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 10, 2023

roachtest.failover/system-non-liveness/blackhole-send/lease=expiration failed with artifacts on master @ 61806f42e4c833b863ca9c2f62ce34918f1c8277:

test artifacts and logs in: /artifacts/failover/system-non-liveness/blackhole-send/lease=expiration/run_1
(cluster.go:1616).FailOnInvalidDescriptors: invalid descriptors check failed: pq: replica unavailable: (n6,s6):6 unable to serve request to r32:/NamespaceTable/{30-Max} [(n6,s6):6, (n5,s5):5, (n4,s4):3, next=7, gen=16]: closed timestamp: 1686388181.246911253,0 (2023-06-10 09:09:41); raft status: {"id":"6","term":12,"vote":"3","commit":413,"lead":"0","raftState":"StatePreCandidate","applied":413,"progress":{},"leadtransferee":"0"}: have been waiting 60.50s for slow proposal RequestLease [/NamespaceTable/30,/Min)

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-28672

Epic CRDB-27234

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jun 10, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 10, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 12, 2023

It's unexpected that the replica circuit breaker tripped here. We should recover from blackhole failures within 20 seconds, the circuit breaker doesn't trip until 60 seconds. We saw a similar case over in #104709, where the circuit breaker tripped unexpectedly.

Will dig into this when I have time.

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jun 12, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 12, 2023

cc @cockroachdb/replication

@erikgrinaker erikgrinaker self-assigned this Jun 12, 2023
@erikgrinaker
Copy link
Contributor

The problem seems to be that we don't give the cluster time to recover after running all failures, so when the post-test assertions run the cluster hasn't fully recovered, and it errors out.

We recover the node at 09:10:42:

09:09:41 failover.go:1040: failing n6 (blackhole-send)
09:10:42 failover.go:1045: recovering n6 (blackhole-send)
09:10:43 test_runner.go:1068: tearing down after success; see teardown.log

The post-test assertions run at 09:10:45, failing on r32:

09:10:45 (cluster.go:1616).FailOnInvalidDescriptors: invalid descriptors check failed: pq: replica unavailable: (n6,s6):6 unable to serve request to r32:/NamespaceTable/{30-Max} [(n6,s6):6, (n5,s5):5, (n4,s4):3, next=7, gen=16]: closed timestamp: 1686388181.246911253,0 (2023-06-10 09:09:41); raft status: {"id":"6","term":12,"vote":"3","commit":413,"lead":"0","raftState":"StatePreCandidate","applied":413,"progress":{},"leadtransferee":"0"}: have been waiting 60.50s for slow proposal RequestLease [/NamespaceTable/30,/Min)

r32 only recovers at 09:10:45:

09:10:45.124706 212213 kv/kvserver/replica_circuit_breaker.go:150 ⋮ [T1,n6,s6,r32/6:‹/NamespaceTable/{30-Max}›] 2605  breaker: breaker reset

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 20, 2023

It's unexpected that the replica circuit breaker tripped here. We should recover from blackhole failures within 20 seconds, the circuit breaker doesn't trip until 60 seconds.

Also, this isn't accurate. The failure itself (and thus the circuit breaker) won't recover until we actually heal the partition, but the workload will recover.

@erikgrinaker erikgrinaker added the A-testing Testing tools and infrastructure label Jun 21, 2023
@craig craig bot closed this as completed in 776024b Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants