roachtest: kv/restart/nodes=12 failed #102655

cockroach-teamcity · 2023-04-29T06:33:48Z

roachtest.kv/restart/nodes=12 failed with artifacts on master @ 4619767bb0b75ac85ddcb76e33c73211e369afed:

test artifacts and logs in: /artifacts/kv/restart/nodes=12/run_1
(assertions.go:262).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:1045
	            				main/pkg/cmd/roachtest/test_runner.go:992
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	"0.2652998776009792" is not greater than "0.5"
	Test:       	kv/restart/nodes=12
(cluster.go:2028).Run: output in run_060424.609219422_n13_cockroach-workload-r: ./cockroach workload run kv --min-block-bytes=8192 --max-block-bytes=8192 --tolerate-errors --read-percent=50  --duration=30m0s --concurrency=64 {pgurl:1-11} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-27571}

The text was updated successfully, but these errors were encountered:

aliher1911 · 2023-05-02T17:32:24Z

It looks like the same #98928 failure.

I'm looking on the n12 which was restarted and it goes to overload

immediately after restart.
At the same time other nodes shed 125 leases as soon as node comes back. Node itself is struggling with stalling disks and unhealthy storage while processing snapshots. Senders show raft send queue to n12 is full and its logs are full of raft ready handling: 0.98s [append=0.00s, apply=0.00s, non-blocking-sync=0.98s, other=0.00s], wrote [append-batch=25 KiB, append-ent=24 KiB (6), ]; node might be overloaded until it is killed.

One thing I noticed is that DescriptorTable was moved to that node. So maybe it is causing more outage compared to range with kv table data being moved to overloaded node.

We removed release blocker from it last time, do we need to revisit this?

aliher1911 · 2023-05-02T17:33:06Z

@kvoli assigning to you as an owner of previous instance of this failure to make a call.

aliher1911 · 2023-05-03T08:29:35Z

Removed release blocker as this is a known issue which affects previous versions as well.

kvoli · 2023-05-05T23:52:58Z

The lease transfer towards n12 occur approx. 7s after it rejoins the cluster, until the node becomes IO overloaded.

Details

Rejoining gossip

I230429 06:26:05.853633 445 gossip/client.go:124 ⋮ [T1,n12] 23  started gossip client to n0 (‹34.132.147.1:26257›)

Lease transfers towards n12 from n1

I230429 06:26:15.541366 3383900 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r715/1:‹/Table/106/1/305{5089…-6011…}›] 31767  transferring lease to s12
I230429 06:26:17.610008 3387693 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r10274/2:‹/Table/106/1/-544{4745…-3823…}›] 31768  transferring lease to s12
I230429 06:26:20.525492 3391582 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r13184/1:‹/Table/106/1/-449{1096…-0174…}›] 31769  transferring lease to s12
I230429 06:26:24.661096 3397429 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r1091/4:‹/Table/106/1/629{2330…-3253…}›] 31770  transferring lease to s12
I230429 06:26:29.768146 3401531 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r10380/4:‹/Table/106/1/363{2443…-3365…}›] 31771  transferring lease to s12
I230429 06:26:31.225278 3403627 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r15251/4:‹/Table/106/1/-216{1389…-0466…}›] 31772  transferring lease to s12
I230429 06:26:39.130465 3409897 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r9154/2:‹/Table/106/1/-630{8009…-7087…}›] 31773  transferring lease to s12
I230429 06:26:42.659058 3412547 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r6365/2:‹/Table/106/1/167{4419…-5341…}›] 31774  transferring lease to s12
I230429 06:26:45.579395 3414001 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r10647/5:‹/Table/106/1/146{6903…-7826…}›] 31775  transferring lease to s12
I230429 06:26:47.163267 3414110 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r14051/3:‹/Table/106/1/-699{4194…-3272…}›] 31776  transferring lease to s12
I230429 06:26:51.904053 3417816 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r18019/5:‹/Table/106/1/-632{2766…-1844…}›] 31777  transferring lease to s12
I230429 06:26:55.075423 3418271 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r18458/1:‹/Table/106/1/-659{4842…-3920…}›] 31778  transferring lease to s12
I230429 06:26:56.293519 3418451 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r17854/4:‹/Table/106/1/-761{6740…-5818…}›] 31779  transferring lease to s12

However, that shouldn't be possible since we consider re-joining stores as suspect for 30s after they rejoin the cluster (#97532).

The leases stop attempting to transfer towards n12 once it goes over the threshold (0.5), around 50s after it joins.

It seems as though even if those leases didn't transfer during the suspect period (like the shouldn't have), the store would have still been getting a couple leases due to the IO overload being slow.

I haven't dug into why the store's didn't consider it suspect after startup. I'll keep looking Monday. @andrewbaptist.

The `kv/restart/nodes=12` relies on correct allocation decisions to pass. This commit increases the logging verbosity of related components so that if/when the test fails, a root cause is easier to establish. Informs: cockroachdb#102655 Epic: none Release note: None

kvoli · 2023-05-08T23:03:44Z

Stressing now with increased verbosity.

kvoli · 2023-05-09T13:09:58Z

Stressed overnight (12 times) without failure on master. We may have just gotten unlucky here and the test may have failed even with the suspect being respected - (IO overload didn't occur until 50s after node restarted, so there's a 20s gap where it could receive leases regardless).

I've opened #102915 to increase the logging verbosity so that if this fails again we have more material to investigate.

Going to let this sit for now and wait for future failures.

102915: roachtest: increase kv/restart/nodes=12 logging verbosity r=andrewbaptist a=kvoli The `kv/restart/nodes=12` relies on correct allocation decisions to pass. This commit increases the logging verbosity of related components so that if/when the test fails, a root cause is easier to establish. Informs: #102655 Epic: none Release note: None Co-authored-by: Austen McClernon <[email protected]>

cockroach-teamcity · 2023-05-27T06:44:21Z

roachtest.kv/restart/nodes=12 failed with artifacts on master @ 9d6e7baebdc0d8b1a57ce8dc158fec35b48e9e2a:

test artifacts and logs in: /artifacts/kv/restart/nodes=12/run_1
(cluster.go:2095).Run: output in run_064203.009141900_n13_cockroach-workload-i: ./cockroach workload init kv  --splits=20000 {pgurl:1} returned: COMMAND_PROBLEM: exit status 1
(test_runner.go:1109).func1: 1 dead node(s) detected

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

kvoli · 2023-06-01T13:38:54Z

Most recent failure is an unrelated panic due to #104007. Fixed by #104082.

* ERROR: a panic has occurred!
* panic: iterator with constraint=2 is being used with key /Min that has constraint=1
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:884
*   | github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).checkConstraint
*   | 	github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:574
*   | github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).SeekGE
*   | 	github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:480
*   | github.com/cockroachdb/cockroach/pkg/storage.mvccMinSplitKey
*   | 	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:5904
*   | github.com/cockroachdb/cockroach/pkg/storage.MVCCFirstSplitKey
*   | 	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:5955

Full Stack

*
* ERROR: a panic has occurred!
* panic: iterator with constraint=2 is being used with key /Min that has constraint=1
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:884
*   | github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).checkConstraint
*   | 	github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:574
*   | github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).SeekGE
*   | 	github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:480
*   | github.com/cockroachdb/cockroach/pkg/storage.mvccMinSplitKey
*   | 	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:5904
*   | github.com/cockroachdb/cockroach/pkg/storage.MVCCFirstSplitKey
*   | 	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:5955
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).adminSplitWithDescriptor
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:357
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*splitQueue).processAttempt
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/split_queue.go:321
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*splitQueue).process
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/split_queue.go:225
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplica.func1
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:1020
*   | github.com/cockroachdb/cockroach/pkg/util/timeutil.RunWithTimeout
*   | 	github.com/cockroachdb/cockroach/pkg/util/timeutil/timeout.go:29
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplica
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:979
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processLoop.func2.1
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:890
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
*   | runtime.goexit
*   | 	GOROOT/src/runtime/asm_amd64.s:1594
* Wraps: (2) panic: iterator with constraint=2 is being used with key /Min that has constraint=1
* Error types: (1) *withstack.withStack (2) *errutil.leafError
*
cockroach exited with code 134: Sat May 27 06:42:50 UTC 2023

kvoli · 2023-06-27T14:42:24Z

There have been no failures since Apr 29th (70 days). The last failure on May 27th was due to an unrelated bug mentioned in the above comment.

Closing.

cockroach-teamcity added this to the 23.1 milestone Apr 29, 2023

aliher1911 assigned kvoli May 2, 2023

aliher1911 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 2, 2023

kvoli mentioned this issue May 8, 2023

roachtest: increase kv/restart/nodes=12 logging verbosity #102915

Merged

kvoli closed this as completed Jun 27, 2023

erikgrinaker mentioned this issue Jun 29, 2023

roachtest: kv/restart/nodes=12 failed #105793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: kv/restart/nodes=12 failed #102655

roachtest: kv/restart/nodes=12 failed #102655

cockroach-teamcity commented Apr 29, 2023 •

edited by cockroach-jira-scripts

Loading

aliher1911 commented May 2, 2023

aliher1911 commented May 2, 2023

aliher1911 commented May 3, 2023

kvoli commented May 5, 2023

kvoli commented May 8, 2023

kvoli commented May 9, 2023

cockroach-teamcity commented May 27, 2023

kvoli commented Jun 1, 2023

kvoli commented Jun 27, 2023

roachtest: kv/restart/nodes=12 failed #102655

roachtest: kv/restart/nodes=12 failed #102655

Comments

cockroach-teamcity commented Apr 29, 2023 • edited by cockroach-jira-scripts Loading

aliher1911 commented May 2, 2023

aliher1911 commented May 2, 2023

aliher1911 commented May 3, 2023

kvoli commented May 5, 2023

kvoli commented May 8, 2023

kvoli commented May 9, 2023

cockroach-teamcity commented May 27, 2023

kvoli commented Jun 1, 2023

kvoli commented Jun 27, 2023

cockroach-teamcity commented Apr 29, 2023 •

edited by cockroach-jira-scripts

Loading