Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=false failed #82924

Closed
cockroach-teamcity opened this issue Jun 15, 2022 · 4 comments
Closed

roachtest: clearrange/checks=false failed #82924

cockroach-teamcity opened this issue Jun 15, 2022 · 4 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 15, 2022

roachtest.clearrange/checks=false failed with artifacts on master @ aadbaf97b4e6092ad6978a28e2735715d64d9f10:

test artifacts and logs in: /artifacts/clearrange/checks=false/run_1
	cluster.go:1915,clearrange.go:70,clearrange.go:39,test_runner.go:884: output in run_071129.080360630_n1_cockroach_workload_fixtures_import_bank: ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned: COMMAND_PROBLEM: exit status 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	main/pkg/cmd/roachtest/cluster.go:1949
		  | main.(*clusterImpl).Run
		  | 	main/pkg/cmd/roachtest/cluster.go:1913
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:70
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:884
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_071129.080360630_n1_cockroach_workload_fixtures_import_bank
		Wraps: (3) ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I220615 07:11:29.962045 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 1 tables
		  | Error: importing fixture: importing table bank: pq: pausing due to error; use RESUME JOB to try to proceed once the issue is resolved, or CANCEL JOB to rollback: store 4 has insufficient remaining capacity to ingest data (remaining: 16 GiB / 4.3%, min required: 5.0%)
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-16734

Epic CRDB-16237

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 15, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label Jun 15, 2022
@nicktrav
Copy link
Collaborator

Given this isn't consistently failing, I think that's good enough reason to remove release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. for now. We can add it back if necessary.

@nicktrav nicktrav removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 21, 2022
@nicktrav
Copy link
Collaborator

The relevant part is:

store 4 has insufficient remaining capacity to ingest data (remaining: 16 GiB / 4.3%, min required: 5.0%)

That's a new error that I've not seen before, but it's surprising we were almost out of storage.

The timeseries metrics are likely going to be useful here, so I will boot a local cluster and see what was going on.

@nicktrav
Copy link
Collaborator

This one looks pretty open / shut. Node 3 started seeing a number of disk issues around 7:20:

Jun 15 07:20:58 teamcity-5474915-1655270359-28-n10cpu16-0003 kernel: blk_update_request: critical medium error, dev nvme0n1, sector 223086016 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0

Ultimately it crashed at 7:34 - with code 134, presumably due to the bad disk:

Jun 15 07:34:02 teamcity-5474915-1655270359-28-n10cpu16-0003 kernel: blk_update_request: critical medium error, dev nvme0n1, sector 223086240 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 bash[12672]: ./cockroach.sh: line 70: 12680 Aborted                 (core dumped) "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr>
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 bash[12823]: cockroach exited with code 134: Wed Jun 15 07:34:24 UTC 2022
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 systemd[1]: cockroach.service: Main process exited, code=exited, status=134/n/a
Jun 15 07:34:24 teamcity-5474915-1655270359-28-n10cpu16-0003 systemd[1]: cockroach.service: Failed with result 'exit-code'.

The Pebble logs are full of background errors due to not being able to perform compactions:

I220615 07:34:06.656688 171837 3@pebble/event.go:625 ⋮ [n3,pebble,s3] 103711  [JOB 30621] compacting(default) L4 [008104] (16 M) + L5 [008032] (16 M)
I220615 07:34:06.656868 171837 3@pebble/event.go:657 ⋮ [n3,pebble,s3] 103712  [JOB 30621] compacting: sstable created 030794
I220615 07:34:06.660084 171789 3@pebble/event.go:629 ⋮ [n3,pebble,s3] 103713  [JOB 30620] compaction(default) to L5 error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error
I220615 07:34:06.660132 171789 3@pebble/event.go:621 ⋮ [n3,pebble,s3] 103714  background error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error
I220615 07:34:06.660487 171792 3@pebble/event.go:625 ⋮ [n3,pebble,s3] 103715  [JOB 30622] compacting(default) L4 [008155] (16 M) + L5 [008098] (16 M)
I220615 07:34:06.660677 171792 3@pebble/event.go:657 ⋮ [n3,pebble,s3] 103716  [JOB 30622] compacting: sstable created 030795
I220615 07:34:06.673512 171792 3@pebble/event.go:629 ⋮ [n3,pebble,s3] 103717  [JOB 30622] compaction(default) to L5 error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error
I220615 07:34:06.673574 171792 3@pebble/event.go:621 ⋮ [n3,pebble,s3] 103718  background error: read ‹/mnt/data1/cockroach/008098.sst›: input/output error

The cluster storage capacity decreased by 10%, which meant that all nodes contained more data than a typical run of this test. n4 had a little bit more than the other nodes, and ultimately fell below the 5% threshold, which failed the import.

Screen Shot 2022-06-21 at 2 15 59 PM

There's possibly a thread to pull on with the log spam from the Pebble side, and potentially an argument for bumping the capacity of each node.

Given this is ultimately a hardware flake (i.e. bad disk), there's not a ton we can do here, so I'm going to close this one out.

@nicktrav
Copy link
Collaborator

For the logging issue we have cockroachdb/pebble#270.

@nicktrav nicktrav added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
None yet
Development

No branches or pull requests

2 participants