-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv/restart/nodes=12 failed #105793
Comments
The rejoining node (n12) gets some leases, this tanks throughput. n12 starts @
Then, lease transfers to n12 occur, note it should be suspect for 30s after joining1. Any transfer before start+30s shouldn't have occurred (not exact because of gossip). So before
So why is transfers to n12I230629 01:08:42.624393 4213098 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r17906/6:‹/Table/106/1/-466{3564…-2642…}›] 129836 transferring lease to s12
I230629 01:08:45.126051 4217994 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r4350/4:‹/Table/106/1/667{3237…-4159…}›] 129865 transferring lease to s12
I230629 01:08:51.048358 4224416 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r7741/1:‹/Table/106/1/-81{2077…-1155…}›] 129939 transferring lease to s12
I230629 01:08:52.232525 4224737 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r3075/4:‹/Table/106/1/496{4231…-5154…}›] 129948 transferring lease to s12
I230629 01:08:54.128219 4225868 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r17506/6:‹/Table/106/1/-492{4573…-3650…}›] 129977 transferring lease to s12
I230629 01:09:05.595716 4242419 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n7,replicate,s7,r1849/5:‹/Table/106/1/477{7928…-8851…}›] 130246 transferring lease to s12
I230629 01:08:41.946334 4122834 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n5,replicate,s5,r1550/1:‹/Table/106/1/4955{0088…-9311…}›] 139226 transferring lease to s12
I230629 01:08:47.798536 4133392 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n5,replicate,s5,r16697/6:‹/Table/106/1/-907{7650…-6727…}›] 139290 transferring lease to s12
I230629 01:08:53.304101 4136309 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n5,replicate,s5,r13100/6:‹/Table/106/1/-687{7063…-6141…}›] 139369 transferring lease to s12
I230629 01:08:57.841742 4142780 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n5,replicate,s5,r6809/3:‹/Table/106/1/332{3475…-4398…}›] 139443 transferring lease to s12
I230629 01:08:41.980060 4036141 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n2,replicate,s2,r6726/5:‹/Table/106/1/101{5903…-6825…}›] 124586 transferring lease to s12
I230629 01:08:48.128606 4047227 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n2,replicate,s2,r1584/1:‹/Table/106/1/433{7996…-8918…}›] 124701 transferring lease to s12
I230629 01:08:52.625474 4049273 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n2,replicate,s2,r17648/3:‹/Table/106/1/8634{0280…-9503…}›] 124740 transferring lease to s12
I230629 01:09:05.157093 4065928 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n2,replicate,s2,r14647/3:‹/Table/106/1/-539{5864…-4941…}›] 124929 transferring lease to s12
:I230629 01:08:50.175970 3932213 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n11,replicate,s11,r12461/3:‹/Table/106/1/-37{1222…-0299…}›] 130277 transferring lease to s12
:I230629 01:08:56.199048 3936391 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n11,replicate,s11,r18595/1:‹/Table/106/1/712{0548…-1470…}›] 130356 transferring lease to s12
:I230629 01:08:58.792535 3940176 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n11,replicate,s11,r63/6:‹/Table/{62-106}›] 130399 transferring lease to s12
:I230629 01:09:04.216466 3947253 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n11,replicate,s11,r11729/2:‹/Table/106/1/428{5425…-6347…}›] 130451 transferring lease to s12
I230629 01:08:51.661145 4288152 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n8,replicate,s8,r19571/6:‹/Table/106/1/870{8733…-9655…}›] 145588 transferring lease to s12
I230629 01:08:59.440751 4297656 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n8,replicate,s8,r19828/4:‹/Table/106/1/801{3326…-4248…}›] 145680 transferring lease to s12
I230629 01:09:03.608351 4389906 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n6,replicate,s6,r5855/2:‹/Table/106/1/452{6143…-7065…}›] 135238 transferring lease to s12
:I230629 01:08:46.603084 3971154 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r9857/4:‹/Table/106/1/-8515{974…-052…}›] 135083 transferring lease to s12
:I230629 01:08:49.572995 3975119 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r5004/6:‹/Table/106/1/-5429{988…-066…}›] 135102 transferring lease to s12
:I230629 01:08:52.072284 3975858 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r13258/5:‹/Table/106/1/-681{8036…-7114…}›] 135126 transferring lease to s12
:I230629 01:08:57.895658 3982369 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r9204/2:‹/Table/106/1/-625{3594…-2672…}›] 135215 transferring lease to s12
:I230629 01:09:00.036174 3985322 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r3928/2:‹/Table/106/1/486{6468…-7391…}›] 135244 transferring lease to s12
:I230629 01:09:04.665534 3991489 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n10,replicate,s10,r1962/4:‹/Table/106/1/26{89861…-90784…}›] 135318 transferring lease to s12
I230629 01:08:46.932986 3985799 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r16515/3:‹/Table/106/1/-504{5393…-4471…}›] 130403 transferring lease to s12
I230629 01:08:48.964438 3988774 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r17288/4:‹/Table/106/1/69{59147…-60069…}›] 130427 transferring lease to s12
I230629 01:08:54.580159 3991461 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r16875/6:‹/Table/106/1/-47{1060…-0967…}›] 130516 transferring lease to s12
I230629 01:08:55.893007 3993406 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r17949/6:‹/Table/106/1/-470{4145…-3223…}›] 130530 transferring lease to s12
I230629 01:09:00.428441 4000289 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n3,replicate,s3,r2556/5:‹/Table/106/1/463{7740…-8663…}›] 130590 transferring lease to s12
I230629 01:08:45.865801 4224001 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r10838/3:‹/Table/106/1/-381{4134…-3212…}›] 134619 transferring lease to s12
I230629 01:08:48.683979 4228253 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r8479/5:‹/Table/106/1/-138{8509…-7586…}›] 134653 transferring lease to s12
I230629 01:08:52.788120 4229770 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r1950/5:‹/Table/106/1/-117{4537…-3615…}›] 134702 transferring lease to s12
I230629 01:08:54.426289 4231087 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r8394/4:‹/Table/106/1/-881{8486…-7563…}›] 134716 transferring lease to s12
I230629 01:08:59.919016 4239613 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r13228/4:‹/Table/106/1/-682{2648…-1726…}›] 134775 transferring lease to s12
I230629 01:09:04.828546 4246877 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n9,replicate,s9,r4266/7:‹/Table/106/1/-694{8079…-7157…}›] 134834 transferring lease to s12
I230629 01:08:42.762650 4151863 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r2589/2:‹/Table/106/1/-309{2903…-1980…}›] 121008 transferring lease to s12
I230629 01:08:46.695411 4158617 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r8876/4:‹/Table/106/1/-478{6229…-5307…}›] 121061 transferring lease to s12
I230629 01:08:47.884063 4160479 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r1000/6:‹/Table/106/1/-4{38088…-28865…}›] 121071 transferring lease to s12
I230629 01:08:55.607401 4165943 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r6596/4:‹/Table/106/1/525{4753…-5675…}›] 121202 transferring lease to s12
I230629 01:08:57.740894 4169455 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r6455/3:‹/Table/106/1/350{4244…-5167…}›] 121226 transferring lease to s12
I230629 01:09:01.056669 4173932 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r2106/4:‹/Table/106/1/-34{5398…-4475…}›] 121272 transferring lease to s12
I230629 01:09:02.241352 4175358 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r6513/1:‹/Table/106/1/-113{2112…-1190…}›] 121281 transferring lease to s12
I230629 01:09:04.252963 4178290 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r2002/6:‹/Table/106/1/-791{5563…-4640…}›] 121317 transferring lease to s12 On a side note, this test could benefit from being a perf test, instead of pass fail. Footnotes |
Added some more logging (c31a80d) around suspect state and stressing now for a failure. |
Repro didn't work due to manual split problem. I'm trying out again soon. I sanity checked that rejoining nodes progress through the expected When a node is stopped then rejoins, it should transition:
I see that transition for stopping a node for less than 5 mins: I230707 22:31:26.011777 82158 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 10229 s4 status change from=‹available› to=‹unknown›
I230707 22:32:43.100287 413 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 10363 s4 status change from=‹unknown› to=‹suspect›
I230707 22:33:13.055259 89719 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 10413 s4 status change from=‹suspect› to=‹available› When I stop the node for longer than 5 minutes it also looks good: I230707 22:41:13.960406 118517 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 11586 s4 status change from=‹available› to=‹unknown›
I230707 22:45:45.256061 134331 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 12076 s4 status change from=‹unknown› to=‹dead›
I230707 22:46:04.073707 136631 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 12345 s4 status change from=‹dead› to=‹suspect›
I230707 22:46:33.535107 138061 13@kv/kvserver/allocator/storepool/store_pool.go:141 ⋮ [T1] 12403 s4 status change from=‹suspect› to=‹available› |
Passed stress x10 on repro branch with:
|
Marking this as closed. It is passing consistently (10 runs) on |
roachtest.kv/restart/nodes=12 failed with artifacts on master @ 7fd4c21157221eae9e7d5892d89d2b5a671aba3e:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29197
The text was updated successfully, but these errors were encountered: