Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/splits/nodes=3/quiesce=false failed #88678

Closed
cockroach-teamcity opened this issue Sep 25, 2022 · 1 comment · Fixed by #88745
Closed

roachtest: kv/splits/nodes=3/quiesce=false failed #88678

cockroach-teamcity opened this issue Sep 25, 2022 · 1 comment · Fixed by #88745
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 25, 2022

roachtest.kv/splits/nodes=3/quiesce=false failed with artifacts on master @ 51c8aae748d338549400c047796c6c9b892527da:

		  | r12    0x6
		  | r13    0x2
		  | r14    0xc000102680
		  | r15    0x7f15679fbf40
		  | rip    0x49a101
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  |
		  | stdout:
		Wraps: (4) SSH_PROBLEM
		Wraps: (5) Node 4. Command with error:
		  | ``````
		  | ./workload run kv --init --max-ops=1 --concurrency=192 --splits=30000 {pgurl:1-3}
		  | ``````
		Wraps: (6) exit status 255
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.SSH (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,kv.go:729,test_runner.go:928: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKVSplits.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:729
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:928
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6340
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:233
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

	test_runner.go:1059,test_runner.go:958: test timed out (2h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-19880

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 25, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Sep 25, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Sep 25, 2022
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 26, 2022

We saw the test time out after 2 hours while performing manual splits.

In logs/1.unredacted/cockroach.stderr.log, we can see a stalled goroutine attempting to allocate a new range ID:

* goroutine 564601 [select, 108 minutes]:
* github.com/cockroachdb/cockroach/pkg/kv/kvserver/idalloc.(*Allocator).Allocate(0x6673a88?, {0x6673a88, 0xc030c810e0})
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/idalloc/id_alloc.go:88 +0xd7
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).AllocateRangeID(...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:2947
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).adminSplitWithDescriptor(0xc004d04000, {0x6673a88, 0xc030c810e0}, {{{0xc02ab19dd0, 0xb, 0x10}, {0x0, 0x0, 0x0}, 0x0}, ...}, ...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:405 +0xa6d
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).AdminSplit.func1(0x171803366c7a9565?)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:80 +0xa5
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminCommandWithDescriptor(0x1?, {0x6673a88, 0xc030c810e0}, 0xc0123ed568)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:549 +0x2e9
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).AdminSplit(0x0?, {0x6673a88?, 0xc030c810e0?}, {{{0xc02ab19dd0, 0xb, 0x10}, {0x0, 0x0, 0x0}, 0x0}, ...}, ...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:78 +0x15a

In the idalloc package, we see the main goroutine in a retry loop:

* goroutine 2466 [select]:
* github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc012743e88)
* 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:127 +0x13e
* github.com/cockroachdb/cockroach/pkg/kv/kvserver/idalloc.(*Allocator).start.func1({0x6673a88, 0xc006d6cb10})
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/idalloc/id_alloc.go:109 +0x2f4
* github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489 +0x146
* created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:480 +0x43b

In the logs, we see warnings that mention the reason for the retries:

W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 ⋮ [n1,s1,idalloc] 76418  unable to allocate 10 ids from ‹/System/"range-idgen"›: replica unavailable: (n1,s1):1 unable to serve request to r3:‹/System/{NodeLivenessMax-tsd}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1664084140.379435638,0 (2022-09-25 05:35:40); raft status: {"id":"1","term":824,"vote":"0","commit":3287,"lead":"0","raftState":"StateFollower","applied":3287,"progress":{},"leadtransferee":"0"}: encountered poisoned latch ‹/Systemrange-idgen›@1664084113.600101524,0
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 ⋮ [n1,s1,idalloc] 76418 +(1) replica unavailable: (n1,s1):1 unable to serve request to r3:‹/System/{NodeLivenessMax-tsd}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1664084140.379435638,0 (2022-09-25 05:35:40); raft status: {"id":"1","term":824,"vote":"0","commit":3287,"lead":"0","raftState":"StateFollower","applied":3287,"progress":{},"leadtransferee":"0"}: encountered poisoned latch ‹/Systemrange-idgen›@1664084113.600101524,0
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +Wraps: (2)
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | (opaque error wrapper)
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | reportable 0:
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  |
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.replicaUnavailableError
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:276
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).replicaUnavailableError
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:285
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:467
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:179
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).SendWithWriteBytes
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:204
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).SendWithWriteBytes
W220925 07:29:36.577480 2466 kv/kvserver/idalloc/id_alloc.go:120 â‹® [n1,s1,idalloc] 76418 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:227

It's not clear to me why the replica circuit breaker was tripped and why it could not recover. Keeping this as a release blocker for now, until we can understand it better.

@craig craig bot closed this as completed in 6711a8b Sep 27, 2022
@erikgrinaker erikgrinaker self-assigned this Sep 27, 2022
@lunevalex lunevalex added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants