Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/restart/nodes=12 failed #102655

Closed
cockroach-teamcity opened this issue Apr 29, 2023 · 9 comments
Closed

roachtest: kv/restart/nodes=12 failed #102655

cockroach-teamcity opened this issue Apr 29, 2023 · 9 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 29, 2023

roachtest.kv/restart/nodes=12 failed with artifacts on master @ 4619767bb0b75ac85ddcb76e33c73211e369afed:

test artifacts and logs in: /artifacts/kv/restart/nodes=12/run_1
(assertions.go:262).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:1045
	            				main/pkg/cmd/roachtest/test_runner.go:992
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	"0.2652998776009792" is not greater than "0.5"
	Test:       	kv/restart/nodes=12
(cluster.go:2028).Run: output in run_060424.609219422_n13_cockroach-workload-r: ./cockroach workload run kv --min-block-bytes=8192 --max-block-bytes=8192 --tolerate-errors --read-percent=50  --duration=30m0s --concurrency=64 {pgurl:1-11} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-27571

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Apr 29, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Apr 29, 2023
@aliher1911
Copy link
Contributor

It looks like the same #98928 failure.

I'm looking on the n12 which was restarted and it goes to overload
Screenshot 2023-05-02 at 17 48 48
immediately after restart.
At the same time other nodes shed 125 leases as soon as node comes back. Node itself is struggling with stalling disks and unhealthy storage while processing snapshots. Senders show raft send queue to n12 is full and its logs are full of raft ready handling: 0.98s [append=0.00s, apply=0.00s, non-blocking-sync=0.98s, other=0.00s], wrote [append-batch=25 KiB, append-ent=24 KiB (6), ]; node might be overloaded until it is killed.

One thing I noticed is that DescriptorTable was moved to that node. So maybe it is causing more outage compared to range with kv table data being moved to overloaded node.

We removed release blocker from it last time, do we need to revisit this?

@aliher1911
Copy link
Contributor

@kvoli assigning to you as an owner of previous instance of this failure to make a call.

@aliher1911 aliher1911 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 2, 2023
@aliher1911
Copy link
Contributor

Removed release blocker as this is a known issue which affects previous versions as well.

@kvoli
Copy link
Collaborator

kvoli commented May 5, 2023

The lease transfer towards n12 occur approx. 7s after it rejoins the cluster, until the node becomes IO overloaded.

Details

Rejoining gossip

I230429 06:26:05.853633 445 gossip/client.go:124 ⋮ [T1,n12] 23  started gossip client to n0 (‹34.132.147.1:26257›)

Lease transfers towards n12 from n1

I230429 06:26:15.541366 3383900 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r715/1:‹/Table/106/1/305{5089…-6011…}›] 31767  transferring lease to s12
I230429 06:26:17.610008 3387693 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r10274/2:‹/Table/106/1/-544{4745…-3823…}›] 31768  transferring lease to s12
I230429 06:26:20.525492 3391582 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r13184/1:‹/Table/106/1/-449{1096…-0174…}›] 31769  transferring lease to s12
I230429 06:26:24.661096 3397429 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r1091/4:‹/Table/106/1/629{2330…-3253…}›] 31770  transferring lease to s12
I230429 06:26:29.768146 3401531 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r10380/4:‹/Table/106/1/363{2443…-3365…}›] 31771  transferring lease to s12
I230429 06:26:31.225278 3403627 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r15251/4:‹/Table/106/1/-216{1389…-0466…}›] 31772  transferring lease to s12
I230429 06:26:39.130465 3409897 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r9154/2:‹/Table/106/1/-630{8009…-7087…}›] 31773  transferring lease to s12
I230429 06:26:42.659058 3412547 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r6365/2:‹/Table/106/1/167{4419…-5341…}›] 31774  transferring lease to s12
I230429 06:26:45.579395 3414001 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r10647/5:‹/Table/106/1/146{6903…-7826…}›] 31775  transferring lease to s12
I230429 06:26:47.163267 3414110 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r14051/3:‹/Table/106/1/-699{4194…-3272…}›] 31776  transferring lease to s12
I230429 06:26:51.904053 3417816 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r18019/5:‹/Table/106/1/-632{2766…-1844…}›] 31777  transferring lease to s12
I230429 06:26:55.075423 3418271 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r18458/1:‹/Table/106/1/-659{4842…-3920…}›] 31778  transferring lease to s12
I230429 06:26:56.293519 3418451 13@kv/kvserver/replicate_queue.go:2034 ⋮ [T1,n1,replicate,s1,r17854/4:‹/Table/106/1/-761{6740…-5818…}›] 31779  transferring lease to s12

However, that shouldn't be possible since we consider re-joining stores as suspect for 30s after they rejoin the cluster (#97532).

The leases stop attempting to transfer towards n12 once it goes over the threshold (0.5), around 50s after it joins.

It seems as though even if those leases didn't transfer during the suspect period (like the shouldn't have), the store would have still been getting a couple leases due to the IO overload being slow.

I haven't dug into why the store's didn't consider it suspect after startup. I'll keep looking Monday. @andrewbaptist.

kvoli added a commit to kvoli/cockroach that referenced this issue May 8, 2023
The `kv/restart/nodes=12` relies on correct allocation decisions to
pass. This commit increases the logging verbosity of related components
so that if/when the test fails, a root cause is easier to establish.

Informs: cockroachdb#102655

Epic: none

Release note: None
@kvoli
Copy link
Collaborator

kvoli commented May 8, 2023

Stressing now with increased verbosity.

@kvoli
Copy link
Collaborator

kvoli commented May 9, 2023

Stressed overnight (12 times) without failure on master. We may have just gotten unlucky here and the test may have failed even with the suspect being respected - (IO overload didn't occur until 50s after node restarted, so there's a 20s gap where it could receive leases regardless).

I've opened #102915 to increase the logging verbosity so that if this fails again we have more material to investigate.

Going to let this sit for now and wait for future failures.

craig bot pushed a commit that referenced this issue May 10, 2023
102915: roachtest: increase kv/restart/nodes=12 logging verbosity r=andrewbaptist a=kvoli

The `kv/restart/nodes=12` relies on correct allocation decisions to pass. This commit increases the logging verbosity of related components so that if/when the test fails, a root cause is easier to establish.

Informs: #102655

Epic: none

Release note: None

Co-authored-by: Austen McClernon <[email protected]>
@cockroach-teamcity
Copy link
Member Author

roachtest.kv/restart/nodes=12 failed with artifacts on master @ 9d6e7baebdc0d8b1a57ce8dc158fec35b48e9e2a:

test artifacts and logs in: /artifacts/kv/restart/nodes=12/run_1
(cluster.go:2095).Run: output in run_064203.009141900_n13_cockroach-workload-i: ./cockroach workload init kv  --splits=20000 {pgurl:1} returned: COMMAND_PROBLEM: exit status 1
(test_runner.go:1109).func1: 1 dead node(s) detected

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented Jun 1, 2023

Most recent failure is an unrelated panic due to #104007. Fixed by #104082.

* ERROR: a panic has occurred!
* panic: iterator with constraint=2 is being used with key /Min that has constraint=1
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:884
*   | github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).checkConstraint
*   | 	github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:574
*   | github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).SeekGE
*   | 	github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:480
*   | github.com/cockroachdb/cockroach/pkg/storage.mvccMinSplitKey
*   | 	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:5904
*   | github.com/cockroachdb/cockroach/pkg/storage.MVCCFirstSplitKey
*   | 	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:5955
Full Stack
*
* ERROR: a panic has occurred!
* panic: iterator with constraint=2 is being used with key /Min that has constraint=1
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:884
*   | github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).checkConstraint
*   | 	github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:574
*   | github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).SeekGE
*   | 	github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:480
*   | github.com/cockroachdb/cockroach/pkg/storage.mvccMinSplitKey
*   | 	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:5904
*   | github.com/cockroachdb/cockroach/pkg/storage.MVCCFirstSplitKey
*   | 	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:5955
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).adminSplitWithDescriptor
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:357
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*splitQueue).processAttempt
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/split_queue.go:321
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*splitQueue).process
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/split_queue.go:225
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplica.func1
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:1020
*   | github.com/cockroachdb/cockroach/pkg/util/timeutil.RunWithTimeout
*   | 	github.com/cockroachdb/cockroach/pkg/util/timeutil/timeout.go:29
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplica
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:979
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processLoop.func2.1
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:890
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
*   | runtime.goexit
*   | 	GOROOT/src/runtime/asm_amd64.s:1594
* Wraps: (2) panic: iterator with constraint=2 is being used with key /Min that has constraint=1
* Error types: (1) *withstack.withStack (2) *errutil.leafError
*
cockroach exited with code 134: Sat May 27 06:42:50 UTC 2023

@kvoli
Copy link
Collaborator

kvoli commented Jun 27, 2023

There have been no failures since Apr 29th (70 days). The last failure on May 27th was due to an unrelated bug mentioned in the above comment.

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants