Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/restart/nodes=12 failed #105793

Closed
cockroach-teamcity opened this issue Jun 29, 2023 · 7 comments
Closed

roachtest: kv/restart/nodes=12 failed #105793

cockroach-teamcity opened this issue Jun 29, 2023 · 7 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 29, 2023

roachtest.kv/restart/nodes=12 failed with artifacts on master @ 7fd4c21157221eae9e7d5892d89d2b5a671aba3e:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:1061
	            				main/pkg/cmd/roachtest/test_runner.go:1060
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	"0" is not greater than "0.5"
	Test:       	kv/restart/nodes=12
test artifacts and logs in: /artifacts/kv/restart/nodes=12/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-29197

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jun 29, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jun 29, 2023
@erikgrinaker
Copy link
Contributor

Jinxed it @kvoli

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 29, 2023
@kvoli
Copy link
Collaborator

kvoli commented Jun 29, 2023

Jinxed it @kvoli

Yup! I'm hoping it is something new, at least.

@kvoli
Copy link
Collaborator

kvoli commented Jun 29, 2023

The rejoining node (n12) gets some leases, this tanks throughput.

(grafana)
image

n12 starts @ 01:08:36.955742981:

I230629 01:08:36.955837 64 1@cli/start.go:1248 ⋮ [T1,n12] 64 +CockroachDB node starting at 2023-06-29 01:08:36.955742981 +0000 UTC m=+2.909471849 (took 2.5s)

Then, lease transfers to n12 occur, note it should be suspect for 30s after joining1. Any transfer before start+30s shouldn't have occurred (not exact because of gossip). So before 01:09:06.

# it is included in this list, meaning it wasn't filtered out
# due to being suspect.
I230629 01:08:51.659708 4288194 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2518 ⋮ [T1,n8,replicate,s8,r19571/6:‹/Table/106/1/870{8733…-9655…}›] 145580 +‹  12: ranges=4889 leases=0 disk-usage=8.2 GiB queries-per-second=7.30 store-cpu-per-second=488ms io-overload=0.00›
# no IO overload (somewhat expected, the node just started and there's delay).
I230629 01:08:51.661145 4288152 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n8,replicate,s8,r19571/6:‹/Table/106/1/870{8733…-9655…}›] 145588  transferring lease to s12

So why is n12 not treated as suspect here? I think this may be due to recent liveness status changes @andrewbaptist.

transfers to n12
I230629 01:08:42.624393 4213098 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r17906/6:‹/Table/106/1/-466{3564-2642…}›] 129836  transferring lease to s12
I230629 01:08:45.126051 4217994 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r4350/4:‹/Table/106/1/667{3237-4159…}›] 129865  transferring lease to s12
I230629 01:08:51.048358 4224416 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r7741/1:‹/Table/106/1/-81{2077-1155…}›] 129939  transferring lease to s12
I230629 01:08:52.232525 4224737 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r3075/4:‹/Table/106/1/496{4231-5154…}›] 129948  transferring lease to s12
I230629 01:08:54.128219 4225868 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r17506/6:‹/Table/106/1/-492{4573-3650…}›] 129977  transferring lease to s12
I230629 01:09:05.595716 4242419 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r1849/5:‹/Table/106/1/477{7928-8851…}›] 130246  transferring lease to s12
I230629 01:08:41.946334 4122834 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n5,replicate,s5,r1550/1:‹/Table/106/1/4955{0088-9311…}›] 139226  transferring lease to s12
I230629 01:08:47.798536 4133392 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n5,replicate,s5,r16697/6:‹/Table/106/1/-907{7650-6727…}›] 139290  transferring lease to s12
I230629 01:08:53.304101 4136309 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n5,replicate,s5,r13100/6:‹/Table/106/1/-687{7063-6141…}›] 139369  transferring lease to s12
I230629 01:08:57.841742 4142780 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n5,replicate,s5,r6809/3:‹/Table/106/1/332{3475-4398…}›] 139443  transferring lease to s12
I230629 01:08:41.980060 4036141 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n2,replicate,s2,r6726/5:‹/Table/106/1/101{5903-6825…}›] 124586  transferring lease to s12
I230629 01:08:48.128606 4047227 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n2,replicate,s2,r1584/1:‹/Table/106/1/433{7996-8918…}›] 124701  transferring lease to s12
I230629 01:08:52.625474 4049273 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n2,replicate,s2,r17648/3:‹/Table/106/1/8634{0280-9503…}›] 124740  transferring lease to s12
I230629 01:09:05.157093 4065928 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n2,replicate,s2,r14647/3:‹/Table/106/1/-539{5864-4941…}›] 124929  transferring lease to s12
:I230629 01:08:50.175970 3932213 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n11,replicate,s11,r12461/3:‹/Table/106/1/-37{1222-0299…}›] 130277  transferring lease to s12
:I230629 01:08:56.199048 3936391 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n11,replicate,s11,r18595/1:‹/Table/106/1/712{0548-1470…}›] 130356  transferring lease to s12
:I230629 01:08:58.792535 3940176 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n11,replicate,s11,r63/6:‹/Table/{62-106}›] 130399  transferring lease to s12
:I230629 01:09:04.216466 3947253 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n11,replicate,s11,r11729/2:‹/Table/106/1/428{5425-6347…}›] 130451  transferring lease to s12
I230629 01:08:51.661145 4288152 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n8,replicate,s8,r19571/6:‹/Table/106/1/870{8733-9655…}›] 145588  transferring lease to s12
I230629 01:08:59.440751 4297656 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n8,replicate,s8,r19828/4:‹/Table/106/1/801{3326-4248…}›] 145680  transferring lease to s12
I230629 01:09:03.608351 4389906 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n6,replicate,s6,r5855/2:‹/Table/106/1/452{6143-7065…}›] 135238  transferring lease to s12
:I230629 01:08:46.603084 3971154 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r9857/4:‹/Table/106/1/-8515{974-052…}›] 135083  transferring lease to s12
:I230629 01:08:49.572995 3975119 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r5004/6:‹/Table/106/1/-5429{988-066…}›] 135102  transferring lease to s12
:I230629 01:08:52.072284 3975858 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r13258/5:‹/Table/106/1/-681{8036-7114…}›] 135126  transferring lease to s12
:I230629 01:08:57.895658 3982369 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r9204/2:‹/Table/106/1/-625{3594-2672…}›] 135215  transferring lease to s12
:I230629 01:09:00.036174 3985322 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r3928/2:‹/Table/106/1/486{6468-7391…}›] 135244  transferring lease to s12
:I230629 01:09:04.665534 3991489 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r1962/4:‹/Table/106/1/26{89861-90784…}›] 135318  transferring lease to s12
I230629 01:08:46.932986 3985799 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r16515/3:‹/Table/106/1/-504{5393-4471…}›] 130403  transferring lease to s12
I230629 01:08:48.964438 3988774 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r17288/4:‹/Table/106/1/69{59147-60069…}›] 130427  transferring lease to s12
I230629 01:08:54.580159 3991461 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r16875/6:‹/Table/106/1/-47{1060-0967…}›] 130516  transferring lease to s12
I230629 01:08:55.893007 3993406 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r17949/6:‹/Table/106/1/-470{4145-3223…}›] 130530  transferring lease to s12
I230629 01:09:00.428441 4000289 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r2556/5:‹/Table/106/1/463{7740-8663…}›] 130590  transferring lease to s12
I230629 01:08:45.865801 4224001 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r10838/3:‹/Table/106/1/-381{4134-3212…}›] 134619  transferring lease to s12
I230629 01:08:48.683979 4228253 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r8479/5:‹/Table/106/1/-138{8509-7586…}›] 134653  transferring lease to s12
I230629 01:08:52.788120 4229770 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r1950/5:‹/Table/106/1/-117{4537-3615…}›] 134702  transferring lease to s12
I230629 01:08:54.426289 4231087 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r8394/4:‹/Table/106/1/-881{8486-7563…}›] 134716  transferring lease to s12
I230629 01:08:59.919016 4239613 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r13228/4:‹/Table/106/1/-682{2648-1726…}›] 134775  transferring lease to s12
I230629 01:09:04.828546 4246877 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r4266/7:‹/Table/106/1/-694{8079-7157…}›] 134834  transferring lease to s12
I230629 01:08:42.762650 4151863 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r2589/2:‹/Table/106/1/-309{2903-1980…}›] 121008  transferring lease to s12
I230629 01:08:46.695411 4158617 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r8876/4:‹/Table/106/1/-478{6229-5307…}›] 121061  transferring lease to s12
I230629 01:08:47.884063 4160479 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r1000/6:‹/Table/106/1/-4{38088-28865…}›] 121071  transferring lease to s12
I230629 01:08:55.607401 4165943 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r6596/4:‹/Table/106/1/525{4753-5675…}›] 121202  transferring lease to s12
I230629 01:08:57.740894 4169455 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r6455/3:‹/Table/106/1/350{4244-5167…}›] 121226  transferring lease to s12
I230629 01:09:01.056669 4173932 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r2106/4:‹/Table/106/1/-34{5398-4475…}›] 121272  transferring lease to s12
I230629 01:09:02.241352 4175358 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r6513/1:‹/Table/106/1/-113{2112-1190…}›] 121281  transferring lease to s12
I230629 01:09:04.252963 4178290 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r2002/6:‹/Table/106/1/-791{5563-4640…}›] 121317  transferring lease to s12

On a side note, this test could benefit from being a perf test, instead of pass fail.

Footnotes

  1. https://github.com/cockroachdb/cockroach/pull/97532/files#diff-be49fafc762f9521f8722968bad7a5e02a2f457725d35d4db20852d06e0c78ab

@kvoli
Copy link
Collaborator

kvoli commented Jul 6, 2023

Added some more logging (c31a80d) around suspect state and stressing now for a failure.

@kvoli
Copy link
Collaborator

kvoli commented Jul 7, 2023

Repro didn't work due to manual split problem.

I'm trying out again soon. I sanity checked that rejoining nodes progress through the expected storeStatus states.

When a node is stopped then rejoins, it should transition:

available -> unknown (for up to 5 mins) -> dead (if >5 mins) -> suspect -> available.

I see that transition for stopping a node for less than 5 mins:

I230707 22:31:26.011777 82158 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 10229 s4 status change from=availableto=unknownI230707 22:32:43.100287 413 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 10363 s4 status change from=unknownto=suspectI230707 22:33:13.055259 89719 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 10413 s4 status change from=suspectto=available

When I stop the node for longer than 5 minutes it also looks good:

I230707 22:41:13.960406 118517 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 11586 s4 status change from=availableto=unknownI230707 22:45:45.256061 134331 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 12076 s4 status change from=unknownto=deadI230707 22:46:04.073707 136631 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 12345 s4 status change from=deadto=suspectI230707 22:46:33.535107 138061 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 12403 s4 status change from=suspectto=available

@kvoli
Copy link
Collaborator

kvoli commented Jul 10, 2023

Passed stress x10 on repro branch with:

./pkg/cmd/roachtest/roachstress.sh -c 10 -b 'kv/restart/nodes=12' -- --cpu-quota 330

@andrewbaptist
Copy link
Collaborator

Marking this as closed. It is passing consistently (10 runs) on master and has not failed since Jun 28th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

4 participants