Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-stalled/fuse/log=false,data=false failed #99372

Closed
cockroach-teamcity opened this issue Mar 23, 2023 · 9 comments
Closed

roachtest: disk-stalled/fuse/log=false,data=false failed #99372

cockroach-teamcity opened this issue Mar 23, 2023 · 9 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 23, 2023

roachtest.disk-stalled/fuse/log=false,data=false failed with artifacts on master @ 5493fdfec4e1762c4502fb2f5d42fd28292c9c9d:

test artifacts and logs in: /artifacts/disk-stalled/fuse/log=false_data=false/run_1
(disk_stall.go:247).runDiskStalledDetection: no stall induced, but process exited
(cluster.go:1969).Run: cluster.RunE: context canceled
(cluster.go:1969).Run: cluster.RunE: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-25851

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 23, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 23, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label Mar 23, 2023
@jbowens
Copy link
Collaborator

jbowens commented Mar 23, 2023

This is almost identical to #99215, which also saw n1 unexpectedly disk stall during a test run that's not supposed to disk stall. (cc #97968)

13:51:40 util.go:55: up-replication complete
13:51:50 disk_stall.go:148: test status: starting workload
13:51:50 disk_stall.go:162: test status: pausing 4m0s before inducing write stall
13:55:50 disk_stall.go:175: [w11] 331490.17 transactions completed before stall
13:55:50 disk_stall.go:177: test status: inducing write stall
13:55:51 disk_stall.go:187: test status: waiting 20s before checking SQL conn to n1
13:56:11 disk_stall.go:193: test status: pinging SQL connection to n1
13:56:11 disk_stall.go:195: [w11] pinging n1's connection: %!s(<nil>)
13:56:11 disk_stall.go:206: test status: letting workload continue for 5m39.07506426s with n1 stalled
14:01:50 disk_stall.go:223: [w11] 366388.33 total transactions committed after stall
14:01:50 disk_stall.go:224: [w11] pre-stall tps: 1381.21, post-stall tps: 1017.74
14:01:50 disk_stall.go:231: test status: counting kv rows
14:01:51 disk_stall.go:234: [w11] Scan found 555638 rows.
14:01:53 cluster.go:2057: > systemctl show cockroach.service --property=ActiveExitTimestampMonotonic
14:01:53 test_impl.go:344: test failure #1: full stack retained in failure_1.log: (disk_stall.go:247).runDiskStalledDetection: no stall induced, but process exited
14:01:53 test_impl.go:344: test failure #2: full stack retained in failure_2.log: (cluster.go:1969).Run: cluster.RunE: context canceled
14:01:53 test_impl.go:344: test failure #3: full stack retained in failure_3.log: (cluster.go:1969).Run: cluster.RunE: context canceled
14:01:53 test_runner.go:1004: tearing down after failure; see teardown.log
Mar 23 13:59:44 teamcity-9205663-1679548780-60-n4cpu4-0001 bash[37410]: cockroach exited with code 7: Thu Mar 23 13:59:44 UTC 2023
Mar 23 13:59:44 teamcity-9205663-1679548780-60-n4cpu4-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=7/NOTRUNNING
Mar 23 13:59:44 teamcity-9205663-1679548780-60-n4cpu4-0001 systemd[1]: cockroach.service: Failed with result 'exit-code'.
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106  file write stall detected: disk slowness detected: syncdata on file ‹/mnt/data1/cockroach/real/000661.sst› (0 bytes) has been ongoing for 12.0s
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !goroutine 349020 [running]:
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x1)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x89
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc000415aa0, {{{0xc000b4c630, 0x24}, {0x0, 0x0}, {0x5a7542f, 0x1}, {0x5a7542f, 0x1}}, 0x174f1118ccc07f8d, ...})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/cockroach/pkg/util/log/clog.go:262 +0x97
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x6f2ef60, 0xc000af7d10}, 0x2, 0x4, 0x0, 0x0?, {0x5a3fbd6, 0x1d}, {0xc001f97dc8, 0x1, ...})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x645
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/util/log/log_channels_generated.go:848
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).makeMetricEtcEventListener.func3({{0xc01028cc00, 0x24}, 0x3, 0x0, 0x2ca18a0b7})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1233 +0x215
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble.TeeEventListener.func4({{0xc01028cc00, 0x24}, 0x3, 0x0, 0x2ca18a0b7})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/event.go:672 +0x4d
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/cockroach/pkg/storage.wrapFilesystemMiddleware.func1({{0xc01028cc00, 0x24}, 0x3, 0x0, 0x2ca18a0b7})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/cockroach/pkg/storage/pebble.go:667 +0x97
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).Create.func2(0x4b?, 0x7e2fbb3404?, 0xb0f86c0?)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:631 +0x6b
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker.func1()
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:186 +0x1cf
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:166 +0x5d

...
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !goroutine 350474 [runnable]:
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !syscall.Syscall(0x57e4524f?, 0x333376c8?, 0x80000000000?, 0x7ffff80000000000?)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/syscall/syscall_linux.go:68 +0x27
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !syscall.write(0xc00699a780?, {0xc0033fe000?, 0x66280000?, 0x0?})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/syscall/zsyscall_linux_amd64.go:924 +0x45
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !syscall.Write(...)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/syscall/syscall_unix.go:211
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !internal/poll.ignoringEINTRIO(...)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/internal/poll/fd_unix.go:794
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !internal/poll.(*FD).Write(0xc00699a780?, {0xc0033fe000?, 0x1000?, 0x1000?})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/internal/poll/fd_unix.go:383 +0x36e
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !os.(*File).write(...)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/os/file_posix.go:48
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !os.(*File).Write(0xc006b96a50, {0xc0033fe000?, 0x1000, 0xc0ff34b40e6c009a?})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/os/file.go:176 +0x65
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).Write.func1()
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:216 +0x43
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).timeDiskOp(0xc0077d4aa0, 0x1, 0x1000, 0xc00976e390)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:297 +0x12f
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).Write(0xb?, {0xc0033fe000?, 0x94d?, 0xb?})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:215 +0x8a
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/vfs.(*enospcFile).Write(0xc011c0c888, {0xc0033fe000, 0x1000, 0x1000})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:369 +0x62
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/cockroach/pkg/ccl/storageccl/engineccl.(*encryptedFile).Write(0xc0033fc400, {0xc0033fe000, 0x1000, 0x1000})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/cockroach/pkg/ccl/storageccl/engineccl/encrypted_fs.go:84 +0x10f
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/vfs.(*syncingFile).Write(0xc0077d4af0, {0xc0033fe000, 0x1000, 0x1000})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file.go:65 +0x59
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !bufio.(*Writer).Flush(0xc0033fc440)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/bufio/bufio.go:629 +0x62
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !bufio.(*Writer).Write(0xc0033fc440, {0xc012e0e000?, 0x121c3bc?, 0xc00159a000?})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/bufio/bufio.go:673 +0xd0
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/objstorage.(*fileBufferedWritable).Write(0x11dbd08?, {0xc012e0e000?, 0x2?, 0x297?})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/objstorage/external/com_github_cockroachdb_pebble/objstorage/vfs_writable.go:36 +0x25
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble.(*compactionWritable).Write(0xc009f44de0, {0xc012e0e000?, 0x6117, 0x19e97?})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:434 +0x33
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/sstable.(*Writer).writeCompressedBlock(0xc0036c2000, {0xc012e0e000?, 0x6117, 0x1fd17?}, {0xc0132ef2c0, 0xc00976e6e0?, 0x78})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/writer.go:1673 +0x98
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/sstable.(*writeQueue).performWrite(0xc0033fc480, 0xc0093c8780)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/write_queue.go:66 +0x85
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/sstable.(*writeQueue).addSync(0xc0033fc480, 0xc0093c8780)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/write_queue.go:121 +0x45
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/sstable.(*Writer).flush(0xc0036c2000, {{0xc003d80840?, 0x0?, 0x0?}, 0xc001dd0b40?})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/writer.go:1348 +0x656
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/sstable.(*Writer).maybeFlush(0xc0036c2000, {{0xc003d80840?, 0xc00976e8b0?, 0x0?}, 0xc00976e8f0?}, 0x0?)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/writer.go:1360 +0x75
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/sstable.(*Writer).addPoint(0xc0036c2000, {{0xc003d80840?, 0xc0077d72b0?, 0xc0077dce40?}, 0x7fd8d5a0e1a8?}, {0x7fd8c187ca7c, 0x0, 0x0})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/writer.go:887 +0x272
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble/sstable.(*Writer).Add(0xc009f44d40?, {{0xc003d80840?, 0x9?, 0x10?}, 0xc01240cd80?}, {0x7fd8c187ca7c?, 0x0?, 0x0?})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/writer.go:738 +0xbe
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble.(*DB).runCompaction(0xc001312f00, 0x131, 0xc007316000)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:3129 +0x27b8
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble.(*DB).compact1(0xc001312f00, 0xc007316000, 0x0)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2514 +0x1a5
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble.(*DB).compact.func1({0x6f2ef60, 0xc006bee930})
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2485 +0xad
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !runtime/pprof.Do({0x6f2eef0?, 0xc00007a088?}, {{0xc0002f4940?, 0x11c6900?, 0xc000bacb80?}}, 0xc00ee3bf88)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !github.com/cockroachdb/pebble.(*DB).compact(0xc00ee3bf90?, 0xc00ee3bf60?, 0xc00ee3bf70?)
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2482 +0x6b
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !created by github.com/cockroachdb/pebble.(*DB).maybeScheduleCompactionPicker
F230323 13:59:44.244539 349020 storage/pebble.go:1233 ⋮ [T1,n1] 1106 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2203 +0x5ea

There's a goroutine dump from 13:59:27 (~17s prior to the fatal). There are some goroutines with I/O in progress at the time:

the WAL log writer:

goroutine 348561 [syscall]:
syscall.Syscall(0x11ca9fe?, 0xc0012a0b01?, 0x77ae2e?, 0xb0f86c0?)
	GOROOT/src/syscall/syscall_linux.go:68 +0x27
golang.org/x/sys/unix.Fdatasync(0x7b42ce276a?)
	golang.org/x/sys/unix/external/org_golang_x_sys/unix/zsyscall_linux.go:723 +0x30
github.com/cockroachdb/pebble/vfs.(*linuxFile).SyncData(0x7b42ce276a?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/default_linux_noarm.go:65 +0x1d
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData.func1()
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:256 +0x2e
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).timeDiskOp(0xc012640e10, 0x3, 0x0, 0xc0012a0c80)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:297 +0x12f
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData(0x315?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:255 +0x55
github.com/cockroachdb/pebble/vfs.(*enospcFile).SyncData(0x0?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:413 +0x22
github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync(0xc004098c00?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file.go:113 +0x4b
github.com/cockroachdb/pebble/record.(*LogWriter).syncWithLatency(0xc000091900)
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:545 +0x43
github.com/cockroachdb/pebble/record.(*LogWriter).flushPending(0xc000091900, {0xc01284fb55, 0x315, 0x44b3}, {0xc009f2b100, 0x0, 0x500c26?}, 0x604, 0x602)
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:532 +0x1dc
github.com/cockroachdb/pebble/record.(*LogWriter).flushLoop(0xc000091900, {0x51664c0, 0xb36e6d0})
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:466 +0x358
runtime/pprof.Do({0x6f2eef0?, 0xc00007a088?}, {{0xc0002f4880?, 0xc0033fb860?, 0xc001112000?}}, 0xc0012a0fc0)
	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
github.com/cockroachdb/pebble/record.NewLogWriter.func2()
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:351 +0x5c
created by github.com/cockroachdb/pebble/record.NewLogWriter
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:350 +0x456

A SyncData during a flush:

goroutine 348610 [syscall]:
syscall.Syscall(0xc005cec000?, 0xc00c7a0000?, 0xde7?, 0x1000?)
	GOROOT/src/syscall/syscall_linux.go:68 +0x27
golang.org/x/sys/unix.Fdatasync(0x7b6471664d?)
	golang.org/x/sys/unix/external/org_golang_x_sys/unix/zsyscall_linux.go:723 +0x30
github.com/cockroachdb/pebble/vfs.(*linuxFile).SyncData(0x7b6471664d?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/default_linux_noarm.go:65 +0x1d
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData.func1()
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:256 +0x2e
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).timeDiskOp(0xc005cec000, 0x3, 0x0, 0xc00628be50)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:297 +0x12f
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData(0xc005cec050?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:255 +0x55
github.com/cockroachdb/pebble/vfs.(*enospcFile).SyncData(0x1000?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:413 +0x22
github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync(0xc004c4d000?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file.go:113 +0x4b
github.com/cockroachdb/pebble/objstorage.(*fileBufferedWritable).Finish(0xc00fab90c8)
	github.com/cockroachdb/pebble/objstorage/external/com_github_cockroachdb_pebble/objstorage/vfs_writable.go:44 +0x42
github.com/cockroachdb/pebble/sstable.(*Writer).Close(0xc0025bd500)
	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/writer.go:2003 +0x1673
github.com/cockroachdb/pebble.(*DB).runCompaction.func6({0xc006e6e680, 0x16, 0x16?})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2882 +0x2f6
github.com/cockroachdb/pebble.(*DB).runCompaction(0xc001312f00, 0x130, 0xc004cce400)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:3157 +0x227c
github.com/cockroachdb/pebble.(*DB).flush1(0xc001312f00)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1969 +0x2de
github.com/cockroachdb/pebble.(*DB).flush.func1({0x6f2ef60, 0xc0127386c0})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1797 +0xfa
runtime/pprof.Do({0x6f2eef0?, 0xc00007a088?}, {{0xc0002f4960?, 0x1000000011c6840?, 0x6f7c3a0?}}, 0xc0070f1fa8)
	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
github.com/cockroachdb/pebble.(*DB).flush(0xc0070f1fb8?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1790 +0x58
created by github.com/cockroachdb/pebble.(*DB).maybeScheduleFlush
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1706 +0xea

The SyncData in the flush could be the one referenced in the panic, but conspicuously it's missing from the goroutine dump at panic time (just like in #99215.) I'm nervous that we're missing something here.

@jbowens jbowens added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 23, 2023
@joshimhoff
Copy link
Collaborator

Some questions:

In #99215 you say:

Either way the earlier goroutine dump seems to indicate this was a legitimate disk stall.

I understand we have a goroutine dump from before the stall that shows we are doing a SyncData. How good evidence is that that the disk stall is legit in your view? I guess if syncs take non-zero latency, we would expect to sometimes seem them in goroutine dumps, even with healthy disks.

Related: For all the test failures we have seen, do we always see a goroutine dump that shows we are blocked in the expected operation some time before the stall is detected?

Is the charybdefs FUSE filesystem buggy in some way?

Are we seeing a relatively high rate of failure in case of tests where don't use that FUSE filesystem? I figure we are seeing failures without the FUSE filesystem, based on the fact that #97968 is open.

Do we only see disk stalls on GCP? Do we run this test on AWS?


I guess there are four possibilities that I can think of:

  1. Disk stall detector has bugs and so is false positive alerting. I'd note that Radu found such a bug in vfs: include size of write in DiskSlowInfo pebble#2281 that I fixed pre-merge.
  2. vfs has a bug in it leading to calls like SyncData blocking but not in the actual syscall.
  3. Write stalls are actually happening.
  4. The charybdefs FUSE file system is buggy.

@joshimhoff
Copy link
Collaborator

I notice that we are using local SSDs, not proper block storage, in all these tests, and that most of time calls to SyncData fail. There are some perf notes re: slow syncing on local SSDs at https://cloud.google.com/compute/docs/disks/optimizing-local-ssd-performance#disable_flush. Sort of doubt it is relevant but sharing anyway.

@bananabrick
Copy link
Contributor

Is the charybdefs FUSE filesystem buggy in some way?

#98202 was also failing similarly, and it didn't use the fuse file system.

@jbowens
Copy link
Collaborator

jbowens commented Mar 23, 2023

I understand we have a goroutine dump from before the stall that shows we are doing a SyncData. How good evidence is that that the disk stall is legit in your view? I guess if syncs take non-zero latency, we would expect to sometimes seem them in goroutine dumps, even with healthy disks.

Yeah, it's true we'd reasonably expect to see in-progress syncs. The cockroach logs show the effects of the stall in advance of the fatal (eg, node liveness heartbeats timing out), indicating for some reason writes aren't completing. The fact that goroutine count increases triggering the goroutine dump is also an indication that things are queueing as a result of the stall.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Mar 23, 2023

Ack! So more likely to be either a real write stall or a crazy bug in vfs leading to a legit stall from CRDB perspective, given the additional evidence that happens without charybdefs FUSE filesystem AND that there are various indications in the CRDB logs that a stall is in fact happening. It is also interesting we only see this happening on AWS IMHO.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Mar 23, 2023

I ran what is below 30 times & didn't reproduce a stall during the period of the test where we don't expect a stall. Then I ran out of CPU quota. Would like to dig more but oncall tomorrow and next week...

./bin/roachtest run disk-stalled/fuse/log=false,data=false --cockroach ~/Downloads/cockroach-v23.1.0-alpha.7.linux-amd64/cockroach

@joshimhoff
Copy link
Collaborator

One more note: I looked at all issues linked from #97968. They all are either syncdata or sync (which call fsync IIUC) or no info about write type since o11y change hadn’t landed yet. At least five syncs! This increases my suspicious infra is at least partly at fault… but just a hunch.

nicktrav added a commit to nicktrav/cockroach that referenced this issue Mar 28, 2023
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a
higher test flake rate is observed due to known issues with fsync
latency for local SSDs.

Switch the test to use persistent disks instead.

Touches: cockroachdb#99372.

Release note: None.
craig bot pushed a commit that referenced this issue Mar 28, 2023
…99752 #99774

99433: opt: fixup CTE stats on placeholder queries r=cucaroach a=cucaroach

During optbuilder phase we copy the initial expressions stats into the
fake-rel but this value can change when placeholders are assigned so add
code in AssignPlaceholders to rebuild the cte if the stats change.

Fixes: #99389
Epic: none
Release note: none

99516: metrics: improve ux around _status/vars output r=aadityasondhi a=dhartunian

Previously, the addition of the `tenant` metric label was applied uniformly and could result in confusion for customers who never enable multi-tenancy or c2c. The `tenant="system"` label carries little meaning when there's no tenancy in use.

This change modifies the system tenant label application to only happen when a non- sytem in-process tenant is created.

Additionally, an environment variable:
`COCKROACH_DISABLE_NODE_AND_TENANT_METRIC_LABELS` can be set to `false` to disable the new `tenant` and `node_id` labels. This can be used on single-process tenants to disable the `tenant` label.

Resolves: #94668

Epic: CRDB-18798

Release note (ops change): The
`COCKROACH_DISABLE_NODE_AND_TENANT_METRIC_LABELS` env var can be used to disable the newly introduced metric labels in the `_status/vars` output if they conflict with a customer's scrape configuration.

99522: jobsprofiler: store DistSQL diagram of jobs in job info r=dt a=adityamaru

This change teaches import, cdc, backup and restore
to store their DistSQL plans in the job_info table
under a timestamped info key. The generation and writing
of the plan diagram is done asynchronously so as to not
slow down the execution of the job. A new plan will be
stored everytime the job sets up its DistSQL flow.

Release note: None
Epic: [CRDB-8964](https://cockroachlabs.atlassian.net/browse/CRDB-8964)
Informs: #99729

99574: streamingccl: skip acceptance/c2c on remote cluster setup r=stevendanna a=msbutler

acceptance/c2c currently fails when run on a remote cluster. This patch ensures the test gets skipped when run on a remote cluster. There's no need to run the test on a remote cluster because the other c2c roachtests provide sufficient coverage.

Fixes #99553

Release note: none

99691: codeowners: update sql obs to cluster obs r=maryliag a=maryliag

Update mentions of `sql-observability` to
`cluster-observability`.

Epic: none
Release note: None

99712: ui: connect metrics provider to metrics timescale object r=xinhaoz a=dhartunian

Previously, the `MetricsDataProvider` component queried the redux store for the `TimeScale` object which contained details of the currently active time window. This piece of state was assumed to update to account for the "live" moving window that metrics show when pre-set lookback time windows are selected.

A recent PR: #98331 removed the feature that polled new data from SQL pages, which also disabled polling on metrics pages due to the re-use of `TimeScale`.

This commit modifies the `MetricsDataProvider` to instead read the `metricsTime` field of the `TimeScaleState` object. This object was constructed for use by the `MetricsDataProvider` but was not wired up to the component.

Resolves #99524

Epic: None

Release note: None

99733: telemetry: add FIPS-specific channel r=knz a=rail

Previously, all official builds were reporting using the same telemetry channel.

This PR adds an new telemetry channel for the FIPS build target.

Fixes: CC-24110
Epic: DEVINF-478
Release note: None

99745: spanconfigsqlwatcher: deflake TestSQLWatcherOnEventError r=arulajmani a=arulajmani

Previously, this test was setting the no-op checkpoint duration to be
 every hour to effectively disable checkpoints. Doing so is integral to
what the test is testing. However, this was a lie, given how `util.Every` works -- A call to `ShouldProcess` returns true the very first time.

This patch achieves the original goal by introducing a new testing knob. Previously, the test would fail in < 40 runs locally.  Have this running strong for ~1000 runs.

Fixes #76765

Release note: None

99747: roachtest: use persistent disks for disk-stall tests r=jbowens a=nicktrav

Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a higher test flake rate is observed due to known issues with fsync latency for local SSDs.

Switch the test to use persistent disks instead.

Touches: #99372.

Release note: None.

Epic: CRDB-20293

99752: kvserver: bump tolerance more r=ajwerner a=ajwerner

I'm not digging into this more, but the test is flakey.

Epic: none

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/9161972?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

Release note: None

99774: *: identify remaining uses of TODOSQLCodec r=stevendanna a=knz

The `TODOSQLCodec` was a bug waiting to happen.

The only reasonable remaining purpose is for use in tests. As such, this change moves its definition to a test-only package (we have a linter that verifies that `testutils` is not included in non-test code).

This change then identifies the one non-reasonable remaining purposes and identifies it properly as a bug linked to #48123.

Release note: None
Epic: None

Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: maryliag <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: ajwerner <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Mar 28, 2023
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a
higher test flake rate is observed due to known issues with fsync
latency for local SSDs.

Switch the test to use persistent disks instead.

Touches: #99372.

Release note: None.
blathers-crl bot pushed a commit that referenced this issue Mar 28, 2023
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a
higher test flake rate is observed due to known issues with fsync
latency for local SSDs.

Switch the test to use persistent disks instead.

Touches: #99372.

Release note: None.
blathers-crl bot pushed a commit that referenced this issue Mar 28, 2023
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a
higher test flake rate is observed due to known issues with fsync
latency for local SSDs.

Switch the test to use persistent disks instead.

Touches: #99372.

Release note: None.
nicktrav added a commit to nicktrav/cockroach that referenced this issue Mar 28, 2023
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a
higher test flake rate is observed due to known issues with fsync
latency for local SSDs.

Switch the test to use persistent disks instead.

Touches: cockroachdb#99372.

Release note: None.
nicktrav added a commit that referenced this issue Mar 29, 2023
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a
higher test flake rate is observed due to known issues with fsync
latency for local SSDs.

Switch the test to use persistent disks instead.

Touches: #99372.

Release note: None.
@nicktrav
Copy link
Collaborator

Merged #99747 and associated backports that should help here, using PDs instead of SSDs.

I'm going to close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
Archived in project
Development

No branches or pull requests

5 participants