roachtest: disk-stalled/wal-failover/among-stores failed #135983

cockroach-teamcity · 2024-11-22T11:24:30Z

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.3 @ 82c427f7a58cec597a46a516a1d7d47b68296d18:

(disk_stall.go:151).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2455).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

arch=amd64
cloud=gce
coverageBuild=false
cpu=16
encrypted=true
fs=ext4
localSSD=true
runtimeAssertionsBuild=true
ssd=2

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: disk-stalled/wal-failover/among-stores failed #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
roachtest: disk-stalled/wal-failover/among-stores failed #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest: disk-stalled/wal-failover/among-stores failed #129922 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-master]

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

Jira issue: CRDB-44813

The text was updated successfully, but these errors were encountered:

blathers-crl · 2024-11-22T11:24:33Z

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-11-23T11:43:59Z

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.3 @ 2c6a0102b055cc6180aa32c5ee9962454b35ce4c:

(disk_stall.go:164).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.54252547s at 2024-11-23T11:00:00Z
(disk_stall.go:164).runDiskStalledWALFailover: unexpectedly high p99.99 latency 2.186540992s at 2024-11-23T11:01:00Z
(cluster.go:2455).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

arch=amd64
cloud=gce
coverageBuild=false
cpu=16
encrypted=true
fs=ext4
localSSD=true
runtimeAssertionsBuild=true
ssd=2

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: disk-stalled/wal-failover/among-stores failed #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
roachtest: disk-stalled/wal-failover/among-stores failed #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest: disk-stalled/wal-failover/among-stores failed #129922 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-master]

_{This test on roachdash | Improve this report!}

jbowens · 2024-11-25T19:50:32Z

Most recent failure:

(disk_stall.go:164).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.54252547s at 2024-11-23T11:00:00Z
(disk_stall.go:164).runDiskStalledWALFailover: unexpectedly high p99.99 latency 2.186540992s at 2024-11-23T11:01:00Z

https://grafana.testeng.crdb.io/d/StorageAvKxELVz/storage?from=1732358188352&to=1732362351437&var-cluster=teamcity-17895377-1732344094-72-n4cpu16&orgId=1

Unclear why n1 seems to have been cpu saturated the entire time.

First failure:

2024/11/22 11:17:23 disk_stall.go:117: test status: pausing 2m1.995771852s before next simulated disk stall on n1
2024/11/22 11:18:24 disk_stall.go:117: test status: pausing 1m1.010603551s before next simulated disk stall on n1
2024/11/22 11:19:24 disk_stall.go:117: test status: pausing 990.728199ms before next simulated disk stall on n1
2024/11/22 11:19:26 cluster.go:2475: running cmd `sudo dmsetup suspend --nofl...` on nodes [:1]
2024/11/22 11:19:26 cluster.go:2477: details in run_111926.045419763_n1_sudo-dmsetup-suspend.log
2024/11/22 11:20:53 cluster.go:2475: running cmd `sudo dmsetup resume data1` on nodes [:1]
2024/11/22 11:20:53 cluster.go:2477: details in run_112053.154557363_n1_sudo-dmsetup-resume-.log
2024/11/22 11:20:54 disk_stall.go:117: test status: pausing 9m58.996684485s before next simulated disk stall on n1
2024/11/22 11:21:54 disk_stall.go:117: test status: pausing 7m32.009157141s before next simulated disk stall on n1
2024/11/22 11:22:54 disk_stall.go:117: test status: pausing 6m32.001622751s before next simulated disk stall on n1
2024/11/22 11:23:43 disk_stall.go:146: test status: exited stall loop
2024/11/22 11:23:44 cluster.go:2546: running cmd `systemctl show cockroach-sy...` on nodes [:1]; details in run_112344.046197675_n1_systemctl-show-cockr.log
2024/11/22 11:23:44 test_impl.go:455: test failure #1: full stack retained in failure_1.log: (disk_stall.go:151).runDiskStalledWALFailover: process exited unexpectedly
2024/11/22 11:23:44 cluster.go:2475: running cmd `sudo dmsetup resume data1` on nodes [:1-4]
2024/11/22 11:23:44 cluster.go:2477: details in run_112344.877552358_n1-4_sudo-dmsetup-resume-.log
2024/11/22 11:23:44 test_impl.go:455: test failure #2: full stack retained in failure_2.log: (cluster.go:2455).Run: context canceled
2024/11/22 11:23:44 test_runner.go:1321: test completed with failure(s)

cockroach exited with code 7: Fri Nov 22 11:20:55 UTC 2024

I241122 11:20:29.043039 1602 jobs/registry.go:1599 ⋮ [T1,Vsystem,n1] 1733  AUTO SPAN CONFIG RECONCILIATION job 1023017965946896385: stepping through state succeeded
E241122 11:20:29.043056 1026 server/server_sql.go:515 ⋮ [T1,Vsystem,n1] 1734  failed to run update of instance with new session ID: node unavailable; try another peer
E241122 11:20:29.043262 1602 jobs/registry.go:898 ⋮ [T1,Vsystem,n1] 1735  error getting live session: node unavailable; try another peer
W241122 11:20:29.043426 5688134 sql/stats/automatic_stats.go:963 ⋮ [T1,Vsystem,n1] 1736  failed to create statistics on table 106: create-stats: failed to read query result: query execution canceled
F241122 11:20:29.036844 5798465 storage/pebble.go:1584 ⋮ [n1,s1,pebble] 1728  disk stall detected: disk slowness detected: syncdata on file 049727.log has been ongoing for 62.0s

Looks like we took over a minute to resume the drive.

run_111926.045419763_n1_sudo-dmsetup-suspend: 2024/11/22 11:19:26 cluster.go:2479: > sudo dmsetup suspend --noflush --nolockfs data1
run_111926.045419763_n1_sudo-dmsetup-suspend: 2024/11/22 11:19:27 cluster.go:2504: > result: <ok>

run_112053.154557363_n1_sudo-dmsetup-resume-: 2024/11/22 11:20:53 cluster.go:2479: > sudo dmsetup resume data1
run_112053.154557363_n1_sudo-dmsetup-resume-: 2024/11/22 11:20:54 cluster.go:2504: > result: <ok>

jbowens · 2024-11-25T20:17:40Z

I wonder if the long duration between stalls is somehow related to the fact the disk is stalled? Like is the ssh server trying to write to the disk? That doesn't really make sense because it's a separate mount, but idk. We could refactor it such that when we stall we specify the duration and sleep in the server-side ssh session.

cockroach-teamcity · 2024-12-05T11:01:32Z

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.3 @ 6b51fe4c1a5575d655bfc1e875932e252c7048d0:

(disk_stall.go:151).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2455).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

arch=amd64
cloud=gce
coverageBuild=false
cpu=16
encrypted=true
fs=ext4
localSSD=true
runtimeAssertionsBuild=true
ssd=2

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: disk-stalled/wal-failover/among-stores failed #136428 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3.0-rc release-blocker]
roachtest: disk-stalled/wal-failover/among-stores failed #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2 release-blocker]
roachtest: disk-stalled/wal-failover/among-stores failed #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
roachtest: disk-stalled/wal-failover/among-stores failed #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest: disk-stalled/wal-failover/among-stores failed #129922 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-master]

_{This test on roachdash | Improve this report!}

github-project-automation bot added this to [Deprecated] Storage Nov 22, 2024

blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Nov 22, 2024

github-project-automation bot moved this to Incoming in [Deprecated] Storage Nov 22, 2024

jbowens removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 25, 2024

jbowens self-assigned this Nov 25, 2024

This was referenced Nov 27, 2024

roachtest: disk-stalled/wal-failover/among-stores failed #136284

Closed

roachtest: disk-stalled/wal-failover/among-stores failed #136355

Open

roachtest: disk-stalled/wal-failover/among-stores failed #136428

Closed

cockroach-teamcity mentioned this issue Dec 5, 2024

roachtest: disk-stalled/wal-failover/among-stores failed #129922

Open

jbowens added the P-3 Issues/test failures with no fix SLA label Dec 10, 2024

cockroach-teamcity mentioned this issue Dec 18, 2024

roachtest: disk-stalled/wal-failover/among-stores failed #133804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: disk-stalled/wal-failover/among-stores failed #135983

roachtest: disk-stalled/wal-failover/among-stores failed #135983

cockroach-teamcity commented Nov 22, 2024 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Nov 22, 2024

cockroach-teamcity commented Nov 23, 2024

jbowens commented Nov 25, 2024

jbowens commented Nov 25, 2024

cockroach-teamcity commented Dec 5, 2024

roachtest: disk-stalled/wal-failover/among-stores failed #135983

roachtest: disk-stalled/wal-failover/among-stores failed #135983

Comments

cockroach-teamcity commented Nov 22, 2024 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Nov 22, 2024

cockroach-teamcity commented Nov 23, 2024

jbowens commented Nov 25, 2024

jbowens commented Nov 25, 2024

cockroach-teamcity commented Dec 5, 2024

cockroach-teamcity commented Nov 22, 2024 •

edited by cockroach-jira-scripts

Loading