Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover/chaos/read-write/lease=expiration failed #119361

Closed
cockroach-teamcity opened this issue Feb 19, 2024 · 10 comments · Fixed by #119650
Closed

roachtest: failover/chaos/read-write/lease=expiration failed #119361

cockroach-teamcity opened this issue Feb 19, 2024 · 10 comments · Fixed by #119650
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 19, 2024

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ e39dafe6d8c153301ff43ed2b3ed3e13af9ec72a:

(test_runner.go:1153).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-36166

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Feb 19, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Feb 19, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ a36097be277adef635f55d317579ca79b450bfef:

(test_runner.go:1153).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ c9c3cc5f3c3a4a6ab556f4b9d5b6ec0381901bdb:

(test_runner.go:1153).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@andrewbaptist andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 26, 2024
@andrewbaptist
Copy link
Collaborator

Dupe of #119085

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Feb 26, 2024
Previously, the multiple failures were started and finished
independently. This caused a problem if the ability to recover from one
failure depended on a different failure recovering first. To mitigate
this and to add a little more chaos, start and recover each failure in a
seperate goroutine. This will allow the "most important" failure to
recover first so that the others can recover if they depend on each
other.

Note that this is more important today while we don't support all the
failure modes that the chaos implements. Specifically we don't handle
partial partitions handling yet.

Epic: none
Fixes: cockroachdb#119085
Fixes: cockroachdb#119347
Fixes: cockroachdb#119361
Fixes: cockroachdb#119454

Release note: None
@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ bf013ea0a5311726e65d37e8f047ce39ea2d5f10:

(test_runner.go:1161).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ ce5f34ea97475f45fa354e58aacf424779d0de49:

(test_runner.go:1161).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ 067e48d29b9093038f6fcf2074cd761ffdcd4fe2:

(failover.go:1774).sleepFor: sleep failed: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
(cluster.go:2332).Run: context canceled
(cluster.go:2332).Run: context canceled
(cluster.go:2332).Run: context canceled
(cluster.go:2332).Run: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ c561383c9d86a52cf63a2c8aaa3ee20270635f11:

(test_runner.go:1185).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ c994982a8be5af89f594e115e897dd6d62cf99d8:

(test_runner.go:1185).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ c43f54cdde5b7578f4a0ca61de41463f0d690993:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1474
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:336
	            				main/pkg/cmd/roachtest/monitor.go:120
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should NOT be empty, but was []
	Test:       	failover/chaos/read-write/lease=expiration
	Messages:   	didn't lock any replicas
(require.go:1448).NotEmpty: FailNow called
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
(cluster.go:2344).Run: context canceled
(cluster.go:2344).Run: context canceled
(cluster.go:2344).Run: context canceled
(cluster.go:2344).Run: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@kvoli kvoli added the P-2 Issues/test failures with a fix SLA of 3 months label Apr 8, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write/lease=expiration failed with artifacts on master @ 7dfff9430b0aedaac5bb57e06d704a527336863e:

(test_runner.go:1198).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/failover/chaos/read-write/lease=expiration/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Apr 12, 2024
119650: roachtest: make failure recovery independent r=nvanbenschoten a=andrewbaptist

Previously, the multiple failures were started and finished
independently. This caused a problem if the ability to recover from one
failure depended on a different failure recovering first. To mitigate
this, recover each failure in a separate goroutine. This will allow the
"most important" failure to recover first so that the others can recover
if they depend on each other.

This is more important today while we don't recover from all the failure
modes that chaos implements. Specifically we don't handle partial
partitions fully with epoch leases.

Epic: none
Fixes: #119085
Fixes: #119347
Fixes: #119361
Fixes: #119454

Release note: None

122283: server: don't log for missing locality r=yuzefovich a=andrewbaptist

Previously we would always log a message that the locality was unknown for requests from sql gateways. We should remove unnecessary logs from traces.

Epic: none

Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
@craig craig bot closed this as completed in 31a39cf Apr 12, 2024
blathers-crl bot pushed a commit that referenced this issue Apr 12, 2024
Previously, the multiple failures were started and finished
independently. This caused a problem if the ability to recover from one
failure depended on a different failure recovering first. To mitigate
this, recover each failure in a separate goroutine. This will allow the
"most important" failure to recover first so that the others can recover
if they depend on each other.

This is more important today while we don't recover from all the failure
modes that chaos implements. Specifically we don't handle partial
partitions fully with epoch leases.

Epic: none
Fixes: #119085
Fixes: #119347
Fixes: #119361
Fixes: #119454

Release note: None
blathers-crl bot pushed a commit that referenced this issue May 23, 2024
Previously, the multiple failures were started and finished
independently. This caused a problem if the ability to recover from one
failure depended on a different failure recovering first. To mitigate
this, recover each failure in a separate goroutine. This will allow the
"most important" failure to recover first so that the others can recover
if they depend on each other.

This is more important today while we don't recover from all the failure
modes that chaos implements. Specifically we don't handle partial
partitions fully with epoch leases.

Epic: none
Fixes: #119085
Fixes: #119347
Fixes: #119361
Fixes: #119454

Release note: None
@github-project-automation github-project-automation bot moved this to roachtest/unit test backlog in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
No open projects
Status: roachtest/unit test backlog
Development

Successfully merging a pull request may close this issue.

3 participants