roachtest: clearrange/checks=false failed #82924

cockroach-teamcity · 2022-06-15T08:28:17Z

roachtest.clearrange/checks=false failed with artifacts on master @ aadbaf97b4e6092ad6978a28e2735715d64d9f10:

test artifacts and logs in: /artifacts/clearrange/checks=false/run_1
	cluster.go:1915,clearrange.go:70,clearrange.go:39,test_runner.go:884: output in run_071129.080360630_n1_cockroach_workload_fixtures_import_bank: ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned: COMMAND_PROBLEM: exit status 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	main/pkg/cmd/roachtest/cluster.go:1949
		  | main.(*clusterImpl).Run
		  | 	main/pkg/cmd/roachtest/cluster.go:1913
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:70
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:884
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_071129.080360630_n1_cockroach_workload_fixtures_import_bank
		Wraps: (3) ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I220615 07:11:29.962045 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 1 tables
		  | Error: importing fixture: importing table bank: pq: pausing due to error; use RESUME JOB to try to proceed once the issue is resolved, or CANCEL JOB to rollback: store 4 has insufficient remaining capacity to ingest data (remaining: 16 GiB / 4.3%, min required: 5.0%)
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!

Jira issue: CRDB-16734
Epic CRDB-16237}

The text was updated successfully, but these errors were encountered:

nicktrav · 2022-06-21T19:24:19Z

Given this isn't consistently failing, I think that's good enough reason to remove release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. for now. We can add it back if necessary.

nicktrav · 2022-06-21T19:26:17Z

The relevant part is:

store 4 has insufficient remaining capacity to ingest data (remaining: 16 GiB / 4.3%, min required: 5.0%)

That's a new error that I've not seen before, but it's surprising we were almost out of storage.

The timeseries metrics are likely going to be useful here, so I will boot a local cluster and see what was going on.

nicktrav · 2022-06-21T21:17:39Z

This one looks pretty open / shut. Node 3 started seeing a number of disk issues around 7:20:

Jun 15 07:20:58 teamcity-5474915-1655270359-28-n10cpu16-0003 kernel: blk_update_request: critical medium error, dev nvme0n1, sector 223086016 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0

Ultimately it crashed at 7:34 - with code 134, presumably due to the bad disk:

Jun 15 07:34:02 teamcity-5474915-1655270359-28-n10cpu16-0003 kernel: blk_update_request: critical medium error, dev nvme0n1, sector 223086240 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 bash[12672]: ./cockroach.sh: line 70: 12680 Aborted                 (core dumped) "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr>
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 bash[12823]: cockroach exited with code 134: Wed Jun 15 07:34:24 UTC 2022
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 systemd[1]: cockroach.service: Main process exited, code=exited, status=134/n/a
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 systemd[1]: cockroach.service: Failed with result 'exit-code'.

The Pebble logs are full of background errors due to not being able to perform compactions:

I220615 07:34:06.656688 171837 3@pebble/event.go:625 ⋮ [n3,pebble,s3] 103711  [JOB 30621] compacting(default) L4 [008104] (16 M) + L5 [008032] (16 M)
I220615 07:34:06.656868 171837 3@pebble/event.go:657 ⋮ [n3,pebble,s3] 103712  [JOB 30621] compacting: sstable created 030794
I220615 07:34:06.660084 171789 3@pebble/event.go:629 ⋮ [n3,pebble,s3] 103713  [JOB 30620] compaction(default) to L5 error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error
I220615 07:34:06.660132 171789 3@pebble/event.go:621 ⋮ [n3,pebble,s3] 103714  background error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error
I220615 07:34:06.660487 171792 3@pebble/event.go:625 ⋮ [n3,pebble,s3] 103715  [JOB 30622] compacting(default) L4 [008155] (16 M) + L5 [008098] (16 M)
I220615 07:34:06.660677 171792 3@pebble/event.go:657 ⋮ [n3,pebble,s3] 103716  [JOB 30622] compacting: sstable created 030795
I220615 07:34:06.673512 171792 3@pebble/event.go:629 ⋮ [n3,pebble,s3] 103717  [JOB 30622] compaction(default) to L5 error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error
I220615 07:34:06.673574 171792 3@pebble/event.go:621 ⋮ [n3,pebble,s3] 103718  background error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error

The cluster storage capacity decreased by 10%, which meant that all nodes contained more data than a typical run of this test. n4 had a little bit more than the other nodes, and ultimately fell below the 5% threshold, which failed the import.

There's possibly a thread to pull on with the log spam from the Pebble side, and potentially an argument for bumping the capacity of each node.

Given this is ultimately a hardware flake (i.e. bad disk), there's not a ton we can do here, so I'm going to close this one out.

nicktrav · 2022-06-21T22:02:46Z

For the logging issue we have cockroachdb/pebble#270.

blathers-crl bot added the T-storage Storage Team label Jun 15, 2022

nicktrav removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 21, 2022

nicktrav closed this as completed Jun 21, 2022

nicktrav mentioned this issue Jun 21, 2022

compaction: panic due to index out of bounds cockroachdb/pebble#1781

Closed

nicktrav added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: clearrange/checks=false failed #82924

roachtest: clearrange/checks=false failed #82924

cockroach-teamcity commented Jun 15, 2022 •

edited by exalate-issue-sync bot

Loading

nicktrav commented Jun 21, 2022

nicktrav commented Jun 21, 2022

nicktrav commented Jun 21, 2022

nicktrav commented Jun 21, 2022

roachtest: clearrange/checks=false failed #82924

roachtest: clearrange/checks=false failed #82924

Comments

cockroach-teamcity commented Jun 15, 2022 • edited by exalate-issue-sync bot Loading

nicktrav commented Jun 21, 2022

nicktrav commented Jun 21, 2022

nicktrav commented Jun 21, 2022

nicktrav commented Jun 21, 2022

cockroach-teamcity commented Jun 15, 2022 •

edited by exalate-issue-sync bot

Loading