Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/mixed-headroom/n5cpu16 failed #79765

Closed
cockroach-teamcity opened this issue Apr 11, 2022 · 7 comments
Closed

roachtest: tpcc/mixed-headroom/n5cpu16 failed #79765

cockroach-teamcity opened this issue Apr 11, 2022 · 7 comments
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 11, 2022

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on release-22.1 @ 12fac19acc2f05a1ec7c60e9a50e0c694c491657:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
	monitor.go:127,versionupgrade.go:690,versionupgrade.go:206,tpcc.go:427,test_runner.go:875: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.importLargeBankStep.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:690
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:206
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:427
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 3: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-15924

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 11, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 11, 2022
@tbg
Copy link
Member

tbg commented Apr 11, 2022

[ 1985.186448] Out of memory: Killed process 14684 (cockroach) total-vm:18180576kB, anon-rss:10837820kB, file-rss:54896kB, shmem-rss:0kB, UID:1000 pgtables:32736kB oom_score_adj:0

@tbg
Copy link
Member

tbg commented Apr 11, 2022

cockroach exited with code 137: Mon Apr 11 13:21:03 UTC 2022
from ~80s before: memprof.2022-04-11T13_19_38.785.5028720464.pprof
https://share.polarsignals.com/794bf63/

image

Note that this OOM happened on the predecessor, i.e. v21.2, and tsdump was unable to download the timeseries for that reason.

@blathers-crl
Copy link

blathers-crl bot commented Apr 11, 2022

cc @cockroachdb/bulk-io

@dt
Copy link
Member

dt commented Apr 18, 2022

#78957 merged last week which I think knocks out the big piece of this profile, but I don't know if that really explains the OOM; that big piece, while big, was still pretty much fixed in size based on the workload (row size) and gomaxprocs, so in most of these, it was already as big as it was going to get. I suppose you could blame the oom on it, since being so big it meant we had little headroom remaining for something else (SSTs in raft?) to spike, but I don't think the workload generator mem would itself be the oom-causing spike here.

@dt dt added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 18, 2022
@celiala
Copy link
Collaborator

celiala commented Apr 20, 2022

Hi @shermanCRL - with the T-bulkio label removed, is there another team that is still looking into this (to remove the GA-blocker)? thanks!

@shermanCRL
Copy link
Contributor

Hi @celiala, I believe the GA blocking is on KV at this point. Accurate, @tbg?

@tbg
Copy link
Member

tbg commented Apr 20, 2022

No, as far as KV is concerned this issue can be closed (which I will do now). The heap profile points at the workload causing the OOM and David says that was fixed. We don't understand everything here but there also isn't anything else to look at.

@tbg tbg closed this as completed Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants