Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #89100

Closed
cockroach-teamcity opened this issue Sep 30, 2022 · 9 comments
Closed

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #89100

cockroach-teamcity opened this issue Sep 30, 2022 · 9 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 30, 2022

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-22.2 @ c69571de022f48d95d76fb39e378cc9ab9a30afe:

test artifacts and logs in: /artifacts/tpccbench/nodes=9/cpu=4/multi-region/run_1
	monitor.go:127,tpcc.go:1113,tpcc.go:950,test_runner.go:919: monitor failure: monitor task failed: Non-zero exit code: 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1113
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:950
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor task failed
		Wraps: (5) Non-zero exit code: 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *install.NonZeroExitCode

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-20118

Epic CRDB-20293

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 30, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Sep 30, 2022
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Sep 30, 2022
@renatolabs renatolabs removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 30, 2022
@renatolabs
Copy link
Contributor

Same as #86987.

@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-22.2 @ 600df9f5c387a07fd9b4ba5e54f8b8240645176f:

test artifacts and logs in: /artifacts/tpccbench/nodes=9/cpu=4/multi-region/run_1
	monitor.go:127,tpcc.go:1113,tpcc.go:950,test_runner.go:930: monitor failure: monitor task failed: Non-zero exit code: 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1113
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:950
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor task failed
		Wraps: (5) Non-zero exit code: 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *install.NonZeroExitCode

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-22.2 @ 36e4cb1677e055e67fa3cfdce81dea18cf10df33:

test artifacts and logs in: /artifacts/tpccbench/nodes=9/cpu=4/multi-region/run_1
	monitor.go:127,tpcc.go:1113,tpcc.go:950,test_runner.go:930: monitor failure: monitor task failed: Non-zero exit code: 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1113
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:950
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor task failed
		Wraps: (5) Non-zero exit code: 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *install.NonZeroExitCode

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-22.2 @ fe99dbf5702d1dedeb97d0bac7cb646dc2ec379c:

test artifacts and logs in: /artifacts/tpccbench/nodes=9/cpu=4/multi-region/run_1
	monitor.go:127,tpcc.go:1196,tpcc.go:1033,test_runner.go:930: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 7)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1196
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1033
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 6: dead (exit status 7)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@renatolabs
Copy link
Contributor

Disk stall detected on n6:

I221101 15:07:57.600520 11639 1@storage/pebble.go:1005 â‹® [n5] 382  the server is terminating due to a fatal error (see the DEV channel for details)
F221101 15:07:57.600557 11639 storage/pebble.go:1005 ⋮ [n5] 383  disk stall detected: pebble unable to write to ‹/mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r562/i36.t7› in 20.01 seconds

This error is preceded by likely related timed out hearbeat warnings:

W221101 15:07:55.156600 223 kv/kvserver/liveness/liveness.go:885 â‹® [n5,liveness-hb] 376  slow heartbeat took 4.518066869s; err=context deadline exceeded
W221101 15:07:55.156650 223 kv/kvserver/liveness/liveness.go:787 ⋮ [n5,liveness-hb] 377  failed node liveness heartbeat: ‹operation "node liveness heartbeat" timed out after 4.518s (given timeout 4.5s)›: context deadline exceeded

This seems like a test flake (hardware problem?). I'll let @cockroachdb/storage take a quick look to confirm if that makes sense.

@nicktrav nicktrav added the T-storage Storage Team label Nov 1, 2022
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Nov 1, 2022
@nicktrav nicktrav self-assigned this Nov 1, 2022
@itsbilal
Copy link
Member

itsbilal commented Nov 1, 2022

@renatolabs yes, that just looks like the hardware / infrastructure got overwhelmed and we were waiting 20s for a write. Likely nothing to do here as this usually happens due to things outside our control (AWS/GCP issues, noisy neighbour, something like that), and it seems like a one-off case too.

We could consider bumping the disk stall threshold back up (it used to be 60s), but maybe all it'd have done here is delayed the same outcome by roughly a minute.

@nicktrav
Copy link
Collaborator

nicktrav commented Nov 1, 2022

We could consider bumping the disk stall threshold back up (it used to be 60s)

This was fairly heavily litigated over in #81075. I don't think it's worth going back to 60s, unless this is becoming very noisy / toilsome.

@exalate-issue-sync exalate-issue-sync bot removed the T-testeng TestEng Team label Nov 1, 2022
@itsbilal
Copy link
Member

itsbilal commented Nov 1, 2022

Ah yeah. Didn't mean to suggest it should go all the way up to 60s, there could be a sweet spot in between - but since this isn't a noisy failure at all I think sticking with the status quo makes sense.

@nicktrav
Copy link
Collaborator

nicktrav commented Nov 1, 2022

Looked in the Pebble logs for n6, and I see a bunch of stalled writes, to separate files.

Agree with Bilal's assessment - infra flake.

If we keep seeing these due to the 20s thresholds, let's reconsider.

@nicktrav nicktrav closed this as completed Nov 1, 2022
@nicktrav nicktrav added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
None yet
Development

No branches or pull requests

4 participants