Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv95/enc=false/nodes=3/cpu=32/no-admission failed [context oom] #84801

Closed
cockroach-teamcity opened this issue Jul 21, 2022 · 9 comments · Fixed by #84851
Closed
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 21, 2022

roachtest.kv95/enc=false/nodes=3/cpu=32/no-admission failed with artifacts on master @ 457d724622e4fa2e62d6f4e7926509dbc7d18511:

		  |  1148.0s        0        94798.2        89682.9      0.9      6.3     10.5     32.5 read
		  |  1148.0s        0         5115.6         4719.9      2.0      7.1     11.5     21.0 write
		  |  1149.0s        0        86771.3        89680.4      1.4      6.0      9.4     24.1 read
		  |  1149.0s        0         4574.5         4719.8      2.6      7.6     12.1     24.1 write
		  |  1150.0s        0        77794.4        89670.1      1.6      5.8      8.4     19.9 read
		  |  1150.0s        0         4034.3         4719.2      3.1      7.3     10.0     14.2 write
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |  1151.0s        0            0.0        89592.1      0.0      0.0      0.0      0.0 read
		  |  1151.0s        0            0.0         4715.1      0.0      0.0      0.0      0.0 write
		  |  1152.0s        0            0.0        89514.4      0.0      0.0      0.0      0.0 read
		  |  1152.0s        0            0.0         4711.0      0.0      0.0      0.0      0.0 write
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 4. Command with error:
		  | ``````
		  | ./workload run kv --init --histograms=perf/stats.json --concurrency=192 --splits=1000 --duration=30m0s --read-percent=95 {pgurl:1-3}
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,kv.go:159,kv.go:289,test_runner.go:896: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:159
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:289
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:896
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6498
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:238
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=32 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-17874

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 21, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jul 21, 2022
@erikgrinaker
Copy link
Contributor

OOM-killed. Wouldn't surprise me if many of the other roachtest failures were too.

[ 8685.470905] cockroach invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 8685.470908] CPU: 20 PID: 17965 Comm: cockroach Not tainted 5.4.0-1041-aws #43-Ubuntu
[ 8685.470909] Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017
[ 8685.470909] Call Trace:
[ 8685.470916]  dump_stack+0x6d/0x8b
[ 8685.470919]  dump_header+0x4f/0x1eb
[ 8685.470920]  oom_kill_process.cold+0xb/0x10
[ 8685.470923]  out_of_memory.part.0+0x1df/0x3d0
[ 8685.470924]  out_of_memory+0x6d/0xd0
[ 8685.470927]  mem_cgroup_out_of_memory+0xbd/0xe0
[ 8685.470929]  try_charge+0x77c/0x810
[ 8685.470932]  ? __alloc_pages_nodemask+0x173/0x320
[ 8685.470934]  mem_cgroup_try_charge+0x71/0x190
[ 8685.470936]  mem_cgroup_try_charge_delay+0x22/0x50
[ 8685.470939]  do_anonymous_page+0x13c/0x650
[ 8685.470942]  ? sock_recvmsg+0x54/0x80
[ 8685.470944]  __handle_mm_fault+0x760/0x7a0
[ 8685.470946]  handle_mm_fault+0xca/0x200
[ 8685.470948]  do_user_addr_fault+0x1f9/0x450
[ 8685.470950]  __do_page_fault+0x58/0x90
[ 8685.470951]  do_page_fault+0x2c/0xe0
[ 8685.470953]  do_async_page_fault+0x39/0x70
[ 8685.470955]  async_page_fault+0x34/0x40
[ 8685.470957] RIP: 0033:0x47b709
[ 8685.470958] Code: 90 90 48 8b 44 24 70 e8 f5 3d fe ff 48 8b 74 24 40 48 8b 44 24 30 48 8b 5c 24 28 48 89 74 24 20 48 89 44 24 30 48 89 5c 24 28 <44> 0f 11 3e 48 c7 46 10 00 00 00 00 48 89 46 18 48 8b 4c 24 78 48
[ 8685.470959] RSP: 002b:00007f5ab91fed30 EFLAGS: 00010206
[ 8685.470960] RAX: 000000cf452fc000 RBX: 0000000000002000 RCX: 0000000000000001
[ 8685.470961] RDX: 000000c0000c3800 RSI: 00007f5904491060 RDI: 0000000000000fa8
[ 8685.470962] RBP: 00007f5ab91fed90 R08: 0000000000002000 R09: 0000000000000000
[ 8685.470962] R10: 0000000000000004 R11: 0000000000000010 R12: 00007f59050f3c08
[ 8685.470963] R13: 0000000000000000 R14: 000000c000e0a1a0 R15: 0000000000000040
[ 8685.470986] memory: usage 68330132kB, limit 68330132kB, failcnt 25526
[ 8685.470987] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 8685.470987] kmem: usage 152660kB, limit 9007199254740988kB, failcnt 0
[ 8685.470988] Memory cgroup stats for /system.slice/cockroach.service:
[ 8685.471059] anon 69811589120
               file 532480
               kernel_stack 258048
               slab 11276288
               sock 0
               shmem 0
               file_mapped 0
               file_dirty 0
               file_writeback 1486848
               anon_thp 2185232384
               inactive_anon 0
               active_anon 69811318784
               inactive_file 0
               active_file 782336
               unevictable 0
               slab_reclaimable 9551872
               slab_unreclaimable 1724416
               pgfault 19608831
               pgmajfault 0
               workingset_refault 0
               workingset_activate 0
               workingset_nodereclaim 0
               pgrefill 2673
               pgscan 100165
               pgsteal 99988
               pgactivate 33
               pgdeactivate 2673
               pglazyfree 0
               pglazyfreed 0
               thp_fault_alloc 297
               thp_collapse_alloc 0
[ 8685.471060] Tasks state (memory values in pages):
[ 8685.471061] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 8685.471064] [  17940]  1000 17940     2154      828    53248        0             0 bash
[ 8685.471066] [  17945]  1000 17945 18060893 17070684 142409728        0             0 cockroach
[ 8685.471068] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/system.slice/cockroach.service,task_memcg=/system.slice/cockroach.service,task=cockroach,pid=17945,uid=1000
[ 8685.471194] Memory cgroup out of memory: Killed process 17945 (cockroach) total-vm:72243572kB, anon-rss:68166864kB, file-rss:116692kB, shmem-rss:0kB, UID:1000 pgtables:139072kB oom_score_adj:0
[ 8687.591215] oom_reaper: reaped process 17945 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 21, 2022

Looks like most of this is due to context.WithValue and context.WithCancel, rooted in connExecutor. Might be a context leak or some such. Did a spot check of #84804, same issue.

@cockroachdb/sql-experience @cockroachdb/sql-queries Does any of this ring a bell?

memprof.2022-07-21T08_37_12.773.56908028848.pprof.gz

Screenshot 2022-07-21 at 11 01 57

Rooted here:

Screenshot 2022-07-21 at 11 15 40

@erikgrinaker erikgrinaker changed the title roachtest: kv95/enc=false/nodes=3/cpu=32/no-admission failed roachtest: kv95/enc=false/nodes=3/cpu=32/no-admission failed [oom] Jul 21, 2022
@erikgrinaker erikgrinaker changed the title roachtest: kv95/enc=false/nodes=3/cpu=32/no-admission failed [oom] roachtest: kv95/enc=false/nodes=3/cpu=32/no-admission failed [context oom] Jul 21, 2022
@cucaroach
Copy link
Contributor

I looked at it but didn't come up with anything, cf69a32 looks possibly related.

@cockroach-teamcity
Copy link
Member Author

roachtest.kv95/enc=false/nodes=3/cpu=32/no-admission failed with artifacts on master @ 457d724622e4fa2e62d6f4e7926509dbc7d18511:

		  |   657.0s        0        58638.1        62093.6      1.5     10.0     16.3     46.1 read
		  |   657.0s        0         3073.2         3267.8      3.8     12.1     18.9     37.7 write
		  |   658.0s        0        59960.2        62090.3      1.5     10.0     16.3     39.8 read
		  |   658.0s        0         3231.1         3267.7      4.1     11.5     16.8     37.7 write
		  |   659.0s        0        62016.9        62090.2      1.4      9.4     15.2     56.6 read
		  |   659.0s        0         3190.9         3267.6      3.9     11.0     16.3     27.3 write
		  |   660.0s        0        59945.0        62087.0      1.5     10.0     15.7     48.2 read
		  |   660.0s        0         3195.5         3267.5      3.8     13.6     21.0     37.7 write
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |   661.0s        0        33216.1        62043.3      1.4     10.5     19.9     44.0 read
		  |   661.0s        0         1785.1         3265.2      3.9     14.2     22.0     41.9 write
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 4. Command with error:
		  | ``````
		  | ./workload run kv --init --histograms=perf/stats.json --concurrency=192 --splits=1000 --duration=30m0s --read-percent=95 {pgurl:1-3}
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,kv.go:159,kv.go:289,test_runner.go:896: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:159
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:289
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:896
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6498
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:238
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=32 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

@yuzefovich Mind having a look?

@andreimatei
Copy link
Contributor

@yuzefovich it's not clear to me that we're always calling cancelQuery on the statement contexts.

@yuzefovich
Copy link
Member

Yeah, I don't think we always do - but is that required? Is this what Go compiler often complains about with "possible leak of a context"? I think we can just defer the cancellation and we should be good.

@erikgrinaker
Copy link
Contributor

Yeah, it will leak a goroutine if you don't call cancel.

@andreimatei
Copy link
Contributor

No goroutine is involved if you stick to only using stdlib contexts, but it will leak something.

This was referenced Jul 21, 2022
@craig craig bot closed this as completed in 79edfce Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants