Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-stalled/cgroup/read-write/logs-too=true failed #101904

Closed
cockroach-teamcity opened this issue Apr 20, 2023 · 4 comments
Closed
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 20, 2023

roachtest.disk-stalled/cgroup/read-write/logs-too=true failed with artifacts on release-23.1.0 @ 4524616140097c0f6b921e5d1e94ebd405f0809f:

test artifacts and logs in: /artifacts/disk-stalled/cgroup/read-write/logs-too=true/run_1
(disk_stall.go:233).runDiskStalledDetection: post-stall TPS 804.06 is less than 50% of pre-stall TPS 1614.95
(cluster.go:1981).Run: cluster.RunE: context canceled
(cluster.go:1981).Run: output in run_120135.203044949_n4_cockroach-workload-r: ./cockroach workload run kv --read-percent 50 --duration 10m --concurrency 256 --max-rate 2048 --tolerate-errors  --min-block-bytes=512 --max-block-bytes=512 {pgurl:1-3} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-27302

@cockroach-teamcity cockroach-teamcity added branch-release-23.1.0 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 20, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Apr 20, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/cgroup/read-write/logs-too=true failed with artifacts on release-23.1.0 @ 06fea49dd3b665758ff016a38622e3ce0f1ccfc5:

test artifacts and logs in: /artifacts/disk-stalled/cgroup/read-write/logs-too=true/run_1
(disk_stall.go:233).runDiskStalledDetection: post-stall TPS 810.60 is less than 50% of pre-stall TPS 1621.54
(cluster.go:1981).Run: cluster.RunE: context canceled
(cluster.go:1981).Run: output in run_121429.816223411_n4_cockroach-workload-r: ./cockroach workload run kv --read-percent 50 --duration 10m --concurrency 256 --max-rate 2048 --tolerate-errors  --min-block-bytes=512 --max-block-bytes=512 {pgurl:1-3} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@jbowens jbowens added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 24, 2023
@jbowens
Copy link
Collaborator

jbowens commented Apr 24, 2023

We've seen two of these failures on the release-23.1.0 branch but none on master. The transactions per second post-stall was just a hair shy of half the pre-stall transactions-per-second. I'll run this a few times on the .0 branch today to see if it's possible there's some kind of regression here. It might be the case that the roachtest's post-recoveryTPS threshold is still too aggressive.

@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/cgroup/read-write/logs-too=true failed with artifacts on release-23.1.0 @ 358e0d87912365b8976c55ab9b3292e999cf720d:

test artifacts and logs in: /artifacts/disk-stalled/cgroup/read-write/logs-too=true/run_1
(disk_stall.go:233).runDiskStalledDetection: post-stall TPS 790.96 is less than 50% of pre-stall TPS 1610.06
(cluster.go:1981).Run: cluster.RunE: context canceled
(cluster.go:1981).Run: output in run_114135.464213822_n4_cockroach-workload-r: ./cockroach workload run kv --read-percent 50 --duration 10m --concurrency 256 --max-rate 2048 --tolerate-errors  --min-block-bytes=512 --max-block-bytes=512 {pgurl:1-3} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@jbowens
Copy link
Collaborator

jbowens commented May 31, 2023

I suspect there was some regression in 23.1.{0,1} that has since been fixed in release-23.1 / 23.1.2. We haven't seen any additional failures.

Since it's been resolved, I don't think there's anything to do here.

An observation, you can see that when a node goes down we offer less load because the workload offers fixed load per-node.

Screenshot 2023-05-31 at 2 42 30 PM Screenshot 2023-05-31 at 2 42 36 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants