-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: clearrange/checks=false failed #82924
Comments
Given this isn't consistently failing, I think that's good enough reason to remove
release-blocker
|
The relevant part is:
That's a new error that I've not seen before, but it's surprising we were almost out of storage. The timeseries metrics are likely going to be useful here, so I will boot a local cluster and see what was going on. |
This one looks pretty open / shut. Node 3 started seeing a number of disk issues around 7:20: Jun 15 07:20:58 teamcity-5474915-1655270359-28-n10cpu16-0003 kernel: blk_update_request: critical medium error, dev nvme0n1, sector 223086016 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0 Ultimately it crashed at 7:34 - with code 134, presumably due to the bad disk: Jun 15 07:34:02 teamcity-5474915-1655270359-28-n10cpu16-0003 kernel: blk_update_request: critical medium error, dev nvme0n1, sector 223086240 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 bash[12672]: ./cockroach.sh: line 70: 12680 Aborted (core dumped) "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr>
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 bash[12823]: cockroach exited with code 134: Wed Jun 15 07:34:24 UTC 2022
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 systemd[1]: cockroach.service: Main process exited, code=exited, status=134/n/a
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 systemd[1]: cockroach.service: Failed with result 'exit-code'. The Pebble logs are full of background errors due to not being able to perform compactions: I220615 07:34:06.656688 171837 3@pebble/event.go:625 ⋮ [n3,pebble,s3] 103711 [JOB 30621] compacting(default) L4 [008104] (16 M) + L5 [008032] (16 M)
I220615 07:34:06.656868 171837 3@pebble/event.go:657 ⋮ [n3,pebble,s3] 103712 [JOB 30621] compacting: sstable created 030794
I220615 07:34:06.660084 171789 3@pebble/event.go:629 ⋮ [n3,pebble,s3] 103713 [JOB 30620] compaction(default) to L5 error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error
I220615 07:34:06.660132 171789 3@pebble/event.go:621 ⋮ [n3,pebble,s3] 103714 background error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error
I220615 07:34:06.660487 171792 3@pebble/event.go:625 ⋮ [n3,pebble,s3] 103715 [JOB 30622] compacting(default) L4 [008155] (16 M) + L5 [008098] (16 M)
I220615 07:34:06.660677 171792 3@pebble/event.go:657 ⋮ [n3,pebble,s3] 103716 [JOB 30622] compacting: sstable created 030795
I220615 07:34:06.673512 171792 3@pebble/event.go:629 ⋮ [n3,pebble,s3] 103717 [JOB 30622] compaction(default) to L5 error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error
I220615 07:34:06.673574 171792 3@pebble/event.go:621 ⋮ [n3,pebble,s3] 103718 background error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error The cluster storage capacity decreased by 10%, which meant that all nodes contained more data than a typical run of this test. n4 had a little bit more than the other nodes, and ultimately fell below the 5% threshold, which failed the import. There's possibly a thread to pull on with the log spam from the Pebble side, and potentially an argument for bumping the capacity of each node. Given this is ultimately a hardware flake (i.e. bad disk), there's not a ton we can do here, so I'm going to close this one out. |
For the logging issue we have cockroachdb/pebble#270. |
roachtest.clearrange/checks=false failed with artifacts on master @ aadbaf97b4e6092ad6978a28e2735715d64d9f10:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=16
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-16734
Epic CRDB-16237
The text was updated successfully, but these errors were encountered: