Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=false failed #78408

Closed
cockroach-teamcity opened this issue Mar 24, 2022 · 22 comments
Closed

roachtest: clearrange/checks=false failed #78408

cockroach-teamcity opened this issue Mar 24, 2022 · 22 comments
Assignees
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-disaster-recovery T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 24, 2022

roachtest.clearrange/checks=false failed with artifacts on master @ 8b367174769c89c0fcfe50986ed68d4650be7750:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/clearrange/checks=false/run_1
	cluster.go:1953,clearrange.go:70,clearrange.go:39,test_runner.go:875: output in run_071920.613645006_n1_cockroach_workload_fixtures_import_bank: ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned: COMMAND_PROBLEM: exit status 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	main/pkg/cmd/roachtest/cluster.go:1987
		  | main.(*clusterImpl).Run
		  | 	main/pkg/cmd/roachtest/cluster.go:1951
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:70
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:875
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_071920.613645006_n1_cockroach_workload_fixtures_import_bank
		Wraps: (3) ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I220324 07:19:21.418232 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 1 tables
		  | Error: importing fixture: importing table bank: dial tcp 127.0.0.1:26257: connect: connection refused
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-14116

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 24, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label Mar 24, 2022
@nicktrav
Copy link
Collaborator

nicktrav commented Mar 24, 2022

This one is a new failure mode. Node 1 died due to being OOM killed. This happened on the very step in the test setup - the fixture import.

[  742.764301] Tasks state (memory values in pages):
[  742.764301] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  742.764304] [  12532]  1000 12532     2154      487    53248        0             0 bash
[  742.764306] [  12539]  1000 12539  5685019  3471454 43855872        0             0 cockroach
[  742.764307] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/system.slice/cockroach.service,task_memcg=/system.slice/cockroach.service,task=cockroach,pid=12539,uid=1000
[  742.764475] Memory cgroup out of memory: Killed process 12539 (cockroach) total-vm:22740076kB, anon-rss:13885816kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:42828kB oom_score_adj:0
[  743.536731] oom_reaper: reaped process 12539 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@nicktrav
Copy link
Collaborator

nicktrav commented Mar 24, 2022

The workers are all n1-standard-4's, which have 15GB of RAM (source). Pulling the metrics, node 1 stands out as having different behavior:

Screen Shot 2022-03-24 at 8 14 42 AM

@nicktrav
Copy link
Collaborator

Digging into the memory regions, it's the Go allocated memory that's the dominant factor:

Screen Shot 2022-03-24 at 8 23 31 AM

@nicktrav
Copy link
Collaborator

I see a lot of warnings about node 1 being overloaded, starting right around when memory started increasing on that node:

I220324 07:21:26.707657 309 kv/kvserver/pkg/kv/kvserver/store_raft.go:515 ⋮ [n1,s1,r226/5:‹/Table/106/1/1{32725…-95350…}›,raft] 1551  handle raft ready: 1.6s [applied=4, batches=4, state_assertions=0]; node might be overloaded

Screenshot from 2022-03-24 10-50-22

I found the following error in the logs for node 1:

E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 ⋮ [n1,s1,store-rebalancer] 2137  unable to relocate range to [n9,s9 n1,s1 n7,s7]: while carrying out changes [{ADD_VOTER n9,s9} {REMOVE_VOTER n4,s4}]: change replicas of r407 failed: descriptor changed: [expected] r407:‹/Table/106/1/3{3784918-9061000}› [(n4,s4):1, (n1,s1):2, (n7,s7):3, next=4, gen=94, sticky=1648110326.133963771,0] != [actual] r407:‹/Table/106/1/33{784918-835414}› [(n4,s4):1, (n1,s1):2, (n7,s7):3, next=4, gen=95, sticky=1648110326.133963771,0]
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +(1)
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | (opaque error wrapper)
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | reportable 0:
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  |
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).relocateReplicas
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:2851
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).AdminRelocateRange
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:2769
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminBatch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:951
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithoutRangeID
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:177
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:100
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:197
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:191
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1006
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:344
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/server/node.go:989
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1058
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:554
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:209
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:191
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2060
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1608
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1210
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:831
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:222
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:968
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:951
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv.sendAndFill
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:830
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:853
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).AdminRelocateRange
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:677
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).rebalanceStore.func2
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:370
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).rebalanceStore
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:369
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).Start.func1
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:225
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | runtime.goexit
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +  | 	GOROOT/src/runtime/asm_amd64.s:1581
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +Wraps: (2) while carrying out changes [{ADD_VOTER n9,s9} {REMOVE_VOTER n4,s4}]

Full error preserved here.

@nicktrav
Copy link
Collaborator

cc: @nvanbenschoten (current KV oncall) - reckon you'd be able to help me triage this - specifically the rebalancer error?

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/checks=false failed with artifacts on master @ 32b45c4bcf1ab41f0ba3abd36cb670eea7f450fd:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/clearrange/checks=false/run_1
	cluster.go:1953,clearrange.go:70,clearrange.go:39,test_runner.go:875: output in run_064838.157064865_n1_cockroach_workload_fixtures_import_bank: ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned: COMMAND_PROBLEM: exit status 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	main/pkg/cmd/roachtest/cluster.go:1987
		  | main.(*clusterImpl).Run
		  | 	main/pkg/cmd/roachtest/cluster.go:1951
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:70
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:875
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_064838.157064865_n1_cockroach_workload_fixtures_import_bank
		Wraps: (3) ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I220328 06:48:38.949165 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 1 tables
		  | Error: importing fixture: importing table bank: dial tcp 127.0.0.1:26257: connect: connection refused
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError
Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@jbowens
Copy link
Collaborator

jbowens commented Mar 28, 2022

This reproduces pretty easily. Here's a heap profile dumped shortly before a node OOM'd in my own run.

File: cockroach
Type: inuse_space
Time: Mar 28, 2022 at 2:17pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 3.11GB, 97.50% of 3.19GB total
Dropped 420 nodes (cum <= 0.02GB)
Showing top 20 nodes out of 84
      flat  flat%   sum%        cum   cum%
    1.36GB 42.55% 42.55%     1.36GB 42.55%  github.com/cockroachdb/cockroach/pkg/roachpb.(*Value).ensureRawBytes
    1.23GB 38.41% 80.96%     1.23GB 38.41%  go.etcd.io/etcd/raft/v3/raftpb.(*Entry).Unmarshal
    0.17GB  5.31% 86.26%     0.17GB  5.31%  github.com/cockroachdb/cockroach/pkg/col/coldata.(*element).setNonInlined (inline)
    0.12GB  3.90% 90.16%     0.12GB  3.90%  github.com/cockroachdb/cockroach/pkg/kv/bulk.(*kvBuf).fits
    0.09GB  2.93% 93.09%     0.09GB  2.93%  github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvserverpb.(*ReplicatedEvalResult_AddSSTable).Unmarshal
    0.08GB  2.44% 95.53%     0.08GB  2.44%  google.golang.org/grpc.(*parser).recvMsg
    0.03GB  1.01% 96.54%     0.03GB  1.01%  bytes.makeSlice
    0.03GB  0.97% 97.50%     0.03GB  0.97%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).propose
         0     0% 97.50%     0.03GB  1.01%  bytes.(*Buffer).Write
         0     0% 97.50%     0.03GB  1.01%  bytes.(*Buffer).grow
         0     0% 97.50%     0.17GB  5.31%  github.com/cockroachdb/cockroach/pkg/col/coldata.(*Bytes).Set
         0     0% 97.50%     0.17GB  5.31%  github.com/cockroachdb/cockroach/pkg/col/coldata.(*element).set
         0     0% 97.50%     0.15GB  4.83%  github.com/cockroachdb/cockroach/pkg/kv/bulk.(*BufferingAdder).Add
         0     0% 97.50%     0.03GB  0.93%  github.com/cockroachdb/cockroach/pkg/kv/bulk.(*BufferingAdder).doFlush
         0     0% 97.50%     0.03GB  0.93%  github.com/cockroachdb/cockroach/pkg/kv/bulk.(*SSTBatcher).AddMVCCKey
         0     0% 97.50%     1.30GB 40.85%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1
         0     0% 97.50%     1.30GB 40.85%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1
         0     0% 97.50%     0.03GB  0.98%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send
         0     0% 97.50%     0.03GB     1%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose
         0     0% 97.50%     0.03GB  0.98%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries

Screen Shot 2022-03-28 at 10 36 25 AM

@blathers-crl
Copy link

blathers-crl bot commented Mar 28, 2022

cc @cockroachdb/bulk-io

@cockroach-teamcity

This comment was marked as off-topic.

@tbg
Copy link
Member

tbg commented Mar 29, 2022

https://share.polarsignals.com/a1940fb/ from #78408 (comment)

image

This isn't as obvious yet. @jbowens could you post some heap profiles from your local run since you probably captured at a better time. You can use scripts/pprof-post < file to share the profiles easily.

@sumeerbhola
Copy link
Collaborator

(just trying to follow along)

  • the profile has 2.51GB of memory. Is "This isn't as obvious yet" because that is far from the Go allocated at OOM time?
  • how is the memory for the import (workloadReader.readFiles) managed? 1.51GB in there is significant.
  • regarding the RaftTransport.RaftMessageBatch, I don't see any attempt to grab memory from a memory pool before calling stream.Recv(), which would allow us to limit memory (block for 1 unit of memory, then after recv, grab the remaining without blocking, allowing it to become negative). I do see replicaRequestQueueSize downstream but that is counting queue length (not queue bytes) and is per replica (not unified across the node). Is there something I am missing?

@dt
Copy link
Member

dt commented Mar 29, 2022

how is the memory for the import (workloadReader.readFiles) managed? 1.51GB in there is significant.

I took a brief look at the workload side of this.

The workload synthesizer uses the same common infra as all the other IMPORT "frontends" (e.g. csv, mysql, etc) namely a rowconverter, which can be handed rows (datums) to convert to KVs for it to send on a channel to the "backend" buffer/sort/ingest process part of an IMPORT. This rowconverter helper accumulates "small" batches of KVs to send on this channel rather than sending each KV individually, to reduce locking channel overhead. "Small" is currently defined as 5000 KVs. These batches are then pulled off the channel on the ingest process side where they go into a large, but memory-monitored buffer, (which appears to be well behaved) before being sorted, divvied into SSTables and sent. The channel connecting the front and back end sides is buffered with a capacity of 10 batches.

Workload starts runtime.GOMAXPROCs workers to synthesize rows, each of with have their own converter with its own batch. Thus on 16vCPU machine, we could have up to 16 full batches in converters plus 10 batches in the channel, so we could have 16+10*5000 = as many as 130k KVs across various batches on their way to the ingest side.

This specific roachtest is configuring the workload synthesizer to make 10240 byte payload fields in its rows. So if all those fixed size batches happen to be full at once, with entries that have this relatively large row size, we'd expect to see 130k * 10kb = ~1.3gb. Add in the key size and col-id / family encoding to that 10kb per pk along with the currently draining batch, and maybe a couple drained batches that haven't been been freed yet, and I can see getting to 1.5. That said, I would not expect this spiking much above that; it is big, and not hooked into accounting, but not unbounded that I can see.

While the long-term answer might be to plumb full memory monitoring throughout the IMPORT frontend, in the meantime I'll whip up a patch to just add a second limit to the batch size on aggregate key/val cap, in addition to the 5k count, so we're less susceptible to "big" row input blowing up their footprint.

@dt
Copy link
Member

dt commented Mar 29, 2022

#78945

@dt
Copy link
Member

dt commented Mar 29, 2022

I don't think the above batch sizes stuff explains the spike though, or that #78945 will fix this, since I suspect 1.3-1.5gb is about the biggest this would grow, and doesn't explain what spikes up and OOMs it

@nicktrav
Copy link
Collaborator

I ran the final heap profile from each node in the failed run through pprof (via polarsignals):

Node 1 is clearly different in that it has additional in-use memory tied up in a RaftMessageBatch call stack. I didn't see this on any of the other nodes (2-10).

Screenshot from 2022-03-29 08-39-21

I see the same 1.5G / 70% of in-use space from the workload reader across all the nodes 2-10.

Tagging onto what @sumeerbhola had mentioned, should we be looking at what's happening with RaftTransport.RaftMessageBatch instead?

@tbg
Copy link
Member

tbg commented Mar 30, 2022

Tagging onto what @sumeerbhola had mentioned, should we be looking at what's happening with RaftTransport.RaftMessageBatch instead?

We understand this already, start reading here: #71805 (comment)

I am hoping to look into a mitigation for this "soon" (original plan was for this week, but not sure there is enough week left now; I will try to get started at least). However, once this is mitigated it will likely still indicate an unhealthy node. We are falling behind applying raft commands (node is receiving commands, queueing them, but not pulling them from the queue for handing to raft in a timely fashion), and I would guess this happens due to slowness of the storage layer. This might be worth confirming in these runs if you can, @nicktrav, to get ahead of the "next stage" of problem.

@jbowens jbowens added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 30, 2022
@nicktrav
Copy link
Collaborator

We understand this already, start reading here: #71805 (comment)

Perfect. I was unaware of this until now. Thanks for linking it.

This might be worth confirming in these runs if you can, @nicktrav, to get ahead of the "next stage" of problem.

I think we see the same thing in these runs. The store on the node that crashes is in poor health, relative to the others.

Digging into the storage metrics for the last failed run, it's clear that n1 is different (it's the timeseries that stands out :) ).

Screen Shot 2022-03-30 at 4 09 05 PM

Screen Shot 2022-03-30 at 4 09 23 PM

Here's the same graphs for the first failed run in this ticket:

Screen Shot 2022-03-30 at 4 35 13 PM

Screen Shot 2022-03-30 at 4 35 22 PM

The causality is tricky. Was the disk latency increased due to something else, or is the disk latency causing issues elsewhere.

Read amp looked fine in both cases:

Screen Shot 2022-03-30 at 4 09 49 PM

Screen Shot 2022-03-30 at 4 38 10 PM

@mwang1026
Copy link

@nicktrav officially assigned you for actuarial purposes.

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/checks=false failed with artifacts on master @ 771432d1099e516dbc11827c5458886c176e73e3:

		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:875
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_062023.870680673_n1_cockroach_workload_fixtures_import_bank
		Wraps: (3) ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  |  util/log/clog.go:386  [-] 2 +	github.com/lib/pq/external/com_github_lib_pq/conn.go:1854 +0x46
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +github.com/lib/pq.(*stmt).exec(0xc001370d20, {0xc000c76160, 0xb, 0xb})
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	github.com/lib/pq/external/com_github_lib_pq/conn.go:1446 +0x772
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +github.com/lib/pq.(*conn).query(0xc000354f00, {0xc000988a80, 0x53}, {0xc000c76160, 0xb, 0xb})
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	github.com/lib/pq/external/com_github_lib_pq/conn.go:901 +0x30f
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +github.com/lib/pq.(*conn).QueryContext(0x480df9, {0x62d9088, 0xc000074140}, {0xc000988a80, 0x53}, {0xc000b86700, 0xb, 0x1})
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	github.com/lib/pq/external/com_github_lib_pq/conn_go18.go:24 +0xd7
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +database/sql.ctxDriverQuery({0x62d9088, 0xc000074140}, {0x7f51f42bda78, 0xc000354f00}, {0x0, 0x0}, {0xc000988a80, 0x1cb8c691a1}, {0xc000b86700, 0xb, ...})
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	GOROOT/src/database/sql/ctxutil.go:48 +0x17d
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +database/sql.(*DB).queryDC.func1()
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	GOROOT/src/database/sql/sql.go:1722 +0x175
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +database/sql.withLock({0x6254388, 0xc000ad0bd0}, 0xc001085a28)
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	GOROOT/src/database/sql/sql.go:3396 +0x8c
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +database/sql.(*DB).queryDC(0x4a2201, {0x62d9088, 0xc000074140}, {0x0, 0x0}, 0xc000ad0bd0, 0xc0009d1750, {0xc000988a80, 0x53}, {0xc001337200, ...})
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	GOROOT/src/database/sql/sql.go:1717 +0x211
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +database/sql.(*DB).query(0x7f51f1346a88, {0x62d9088, 0xc000074140}, {0xc000988a80, 0x53}, {0xc001337200, 0xb, 0x10}, 0xd0)
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	GOROOT/src/database/sql/sql.go:1700 +0xfd
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +database/sql.(*DB).QueryContext(0xc0013b9170, {0x62d9088, 0xc000074140}, {0xc000988a80, 0x53}, {0xc001337200, 0xb, 0x10})
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	GOROOT/src/database/sql/sql.go:1674 +0xdf
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +database/sql.(*DB).Query(...)
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	GOROOT/src/database/sql/sql.go:1691
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +github.com/cockroachdb/cockroach/pkg/ccl/workloadccl.importFixtureTable({0x62d9050, 0xc000819380}, 0x38bd6e9ddb4915eb, {0x0, 0x0}, {{0x4e0bf34, 0x4}, {0x4faf916, 0x5b}, {0xfe51, ...}, ...}, ...)
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	github.com/cockroachdb/cockroach/pkg/ccl/workloadccl/fixture.go:454 +0x3bb
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +github.com/cockroachdb/cockroach/pkg/ccl/workloadccl.ImportFixture.func1({0x62d9050, 0xc000819380})
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	github.com/cockroachdb/cockroach/pkg/ccl/workloadccl/fixture.go:402 +0xab
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:166 +0x25
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +golang.org/x/sync/errgroup.(*Group).Go.func1()
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57 +0x67
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +created by golang.org/x/sync/errgroup.(*Group).Go
		  | I220416 12:49:41.214403 97 util/log/clog.go:386  [-] 2 +	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:54 +0x92
		  |
		  | stdout:
		Wraps: (4) SSH_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  | ``````
		Wraps: (6) exit status 255
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.SSH (5) *hintdetail.withDetail (6) *exec.ExitError

	test_runner.go:1006,test_runner.go:905: test timed out (6h30m0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@mari-crl
Copy link
Contributor

ℹ️ Hello! I am a human and not at all a robot! Look at my very human username! 🤖 🎶
🤔 Although I tried very hard to figure out what to do with this issue, more powerful human brains will need to help me. (specifically: Both Github and Jira issues were modified after exalate problems) 😖
🔄 Please visit this issue's mirror at CRDB-14116 and try to sync the two sides up manually. 🌟
✅ When you're finished, comment saying as much asn a member of Developer Infrastructure will be along to finish linking. 🔗
🚫 Note that until this is done, this issue is not and will not be synced to Jira with Exalate. 🚫
😅 Feeling lost? Don't worry about it! A member of @cockroachdb/exalate-22-cleanup-team will be along shortly to help! 👍
👷 Developer Infrastructure members: when ready, open Exalate from the right-hand menu of the mirror issue in Jira, then choose Connect and enter this issue's URN: cockroachdb/cockroach-78408. Either way, delete this comment when you're done. 🔑
🙏 Thank you for your compliance, my fellow humans! 🤖 👋

@nicktrav
Copy link
Collaborator

@tbg - it's been a while and I've since paged out all the relevant info on this one. I believe we're also tracking in #80155 and #71805.

Thoughts on closing this one out, it the failure mode is tracked in the other issues?

@tbg
Copy link
Member

tbg commented May 17, 2022

Sounds good, we also don't see these as often any more right now, and I'm working on #79215 (comment) to guide solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-disaster-recovery T-storage Storage Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

8 participants