kv/kvserver: TestPromoteNonVoterInAddVoter failed #101519

cockroach-teamcity · 2023-04-14T06:23:32Z

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 8e6cdbd8b3d12ba5d65db101a3abfeb2cb1bf470:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/d486233d291fa43f53db749aea077e75/logTestPromoteNonVoterInAddVoter3018638515
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2349: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2349
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 6
        	            	Add voter events: [{r63:/{Table/104-Max} [(n3,s3):1, (n2,s2):2, (n1,s1):7, (n6,s6):8NON_VOTER, (n4,s4):9, next=10, gen=32] <nil> <nil> (n4,s4):9 <nil> range under-replicated {"Target":"s4, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}} {r63:/{Table/104-Max} [(n3,s3):1, (n2,s2):2, (n1,s1):7, (n6,s6):8, (n4,s4):9, next=10, gen=33] <nil> <nil> (n6,s6):8 <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}} {r63:/{Table/104-Max} [(n5,s5):10, (n2,s2):2, (n1,s1):7, (n6,s6):11LEARNER, next=12, gen=44] <nil> <nil> (n6,s6):11LEARNER <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:61"}} {r63:/{Table/104-Max} [(n5,s5):10, (n2,s2):2, (n1,s1):7, (n6,s6):11, next=12, gen=45] <nil> <nil> (n6,s6):11 <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:61"}} {r63:/{Table/104-Max} [(n5,s5):10, (n2,s2):2, (n1,s1):7, (n6,s6):11, (n4,s4):12LEARNER, next=13, gen=46] <nil> <nil> (n4,s4):12LEARNER <nil> range under-replicated {"Target":"s4, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:61"}} {r63:/{Table/104-Max} [(n5,s5):10, (n2,s2):2, (n1,s1):7, (n6,s6):11, (n4,s4):12, next=13, gen=47] <nil> <nil> (n4,s4):12 <nil> range under-replicated {"Target":"s4, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:61"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/d486233d291fa43f53db749aea077e75/logTestPromoteNonVoterInAddVoter3018638515
--- FAIL: TestPromoteNonVoterInAddVoter (30.90s)

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv _{This test on roachdash | Improve this report!

Jira issue: CRDB-26986}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2023-04-25T23:48:20Z

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 4fb894e0398f21a3f7d0ea9f289718ed3ac65fdd:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestPromoteNonVoterInAddVoter71289121
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2350: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2350
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 7
        	            	Add voter events: [{r64:/{Table/104-Max} [(n1,s1):1, (n3,s3):7, (n2,s2):4, (n5,s5):8, (n7,s7):9NON_VOTER, next=10, gen=28] <nil> <nil> (n5,s5):8 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n1,s1):1, (n3,s3):7, (n2,s2):4, (n5,s5):8, (n7,s7):9, next=10, gen=29] <nil> <nil> (n7,s7):9 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n4,s4):10, (n3,s3):7, (n2,s2):4, (n5,s5):8, (n7,s7):11LEARNER, next=12, gen=37] <nil> <nil> (n7,s7):11LEARNER <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}} {r64:/{Table/104-Max} [(n4,s4):10, (n3,s3):7, (n2,s2):4, (n5,s5):8, (n7,s7):11, next=12, gen=38] <nil> <nil> (n7,s7):11 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestPromoteNonVoterInAddVoter71289121
--- FAIL: TestPromoteNonVoterInAddVoter (32.55s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

kvoli · 2023-04-26T19:20:40Z

Tried reproducing:

dev test pkg/kv/kvserver -f TestPromoteNonVoterInAddVoter --stress -v --vmodule=allocator=2,replicate_queue=2,allocator_scorer=2
...
20270 runs so far, 0 failures, over 14m30s

dev test pkg/kv/kvserver -f TestPromoteNonVoterInAddVoter --stress --race -v --vmodule=allocator=2,replicate_queue=2,allocator_scorer=2
...
3908 runs so far, 0 failures, over 14m0s

No success so far.

cockroach-teamcity · 2023-04-27T19:30:23Z

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 4bbe2faae22d21a650fbdbc280e0f3b2ad81aaf6:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/b5b16660ca668d68f7803e1ac963747e/logTestPromoteNonVoterInAddVoter3656602792
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2350: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2350
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 5
        	            	Add voter events: [{r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):4, (n3,s3):7, (n7,s7):8NON_VOTER, (n4,s4):9, next=10, gen=28] <nil> <nil> (n4,s4):9 <nil> range under-replicated {"Target":"s4, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):4, (n3,s3):7, (n7,s7):8, (n4,s4):9, next=10, gen=29] <nil> <nil> (n7,s7):8 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n4,s4):9, (n2,s2):4, (n3,s3):7, (n7,s7):8, (n5,s5):11LEARNER, next=12, gen=37] <nil> <nil> (n5,s5):11LEARNER <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}} {r64:/{Table/104-Max} [(n4,s4):9, (n2,s2):4, (n3,s3):7, (n7,s7):8, (n5,s5):11, next=12, gen=38] <nil> <nil> (n5,s5):11 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/b5b16660ca668d68f7803e1ac963747e/logTestPromoteNonVoterInAddVoter3656602792
--- FAIL: TestPromoteNonVoterInAddVoter (32.09s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-05-08T06:24:38Z

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ adad57859a5281e2846082761c8232890b3d2a68:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/8af8a6c853877a083ec84004541c621b/logTestPromoteNonVoterInAddVoter1191499411
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2350: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2350
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 7
        	            	Add voter events: [{r64:/{Table/104-Max} [(n3,s3):1, (n2,s2):4, (n1,s1):7, (n5,s5):8, (n7,s7):9NON_VOTER, next=10, gen=32] <nil> <nil> (n5,s5):8 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n3,s3):1, (n2,s2):4, (n1,s1):7, (n5,s5):8, (n7,s7):9, next=10, gen=33] <nil> <nil> (n7,s7):9 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):4, (n1,s1):7, (n5,s5):8, (n7,s7):11LEARNER, next=12, gen=41] <nil> <nil> (n7,s7):11LEARNER <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}} {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):4, (n1,s1):7, (n5,s5):8, (n7,s7):11, next=12, gen=42] <nil> <nil> (n7,s7):11 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/8af8a6c853877a083ec84004541c621b/logTestPromoteNonVoterInAddVoter1191499411
--- FAIL: TestPromoteNonVoterInAddVoter (30.29s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

kvoli · 2023-05-08T22:22:31Z

Tried again and no success

dev test pkg/kv/kvserver -f TestPromoteNonVoterInAddVoter --stress --race -v --vmodule=allocator=2,replicate_queue=2,allocator_scorer=2
...
14030 runs so far, 0 failures, over 49m15s

cockroach-teamcity · 2023-05-09T19:17:12Z

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 0d890b56341ca40b5ffb37c51d6b63bc322e80f4:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestPromoteNonVoterInAddVoter1212694418
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2357: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2357
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 5
        	            	Add voter events: [{r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):4, (n3,s3):7, (n6,s6):8, (n5,s5):9NON_VOTER, next=10, gen=28] <nil> <nil> (n6,s6):8 <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):4, (n3,s3):7, (n6,s6):8, (n5,s5):9, next=10, gen=29] <nil> <nil> (n5,s5):9 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):4, (n3,s3):7, (n6,s6):8, (n5,s5):11LEARNER, next=12, gen=37] <nil> <nil> (n5,s5):11LEARNER <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}} {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):4, (n3,s3):7, (n6,s6):8, (n5,s5):11, next=12, gen=38] <nil> <nil> (n5,s5):11 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestPromoteNonVoterInAddVoter1212694418
--- FAIL: TestPromoteNonVoterInAddVoter (31.10s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

kvoli · 2023-05-09T22:06:10Z

My bad - the failure wasn't reproducing because of...

cockroach/pkg/kv/kvserver/replicate_queue_test.go

Line 2216 in 7a8be52

skip.UnderStress(t)

Re-stressing now - I'm going to try and get to the bottom of this one.

kvoli · 2023-05-10T13:59:08Z

Again stressed for 60 minutes (sequentially, not actually stressing) and no failure.

kvoli · 2023-05-10T16:14:50Z

Got a failure

     replicate_queue_test.go:2355:
                 Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2355
                 Error:          Expected to promote non-voter, instead added voter
                 Test:           TestPromoteNonVoterInAddVoter
                 Messages:       Added voter store ID: 5
                                 Add voter events: [
 {r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):7, (n3,s3):4, (n5,s5):8NON_VOTER, (n6,s6):9, next=10, gen=28] <nil> <nil> (n6,s6):9 <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}}
 
 {r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):7, (n3,s3):4, (n5,s5):8, (n6,s6):9, next=10, gen=29] <nil> <nil> (n5,s5):8 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}}
 
 {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, (n5,s5):11LEARNER, next=12, gen=37] <nil> <nil> (n5,s5):11LEARNER <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}
 
 {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, (n5,s5):11, next=12, gen=38] <nil> <nil> (n5,s5):11 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}]

Relevant logs from the replicate queue

{r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, (n5,s5):11, next=12, gen=38] <nil> <nil> (n5,s5):11 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}]  +  +
230510 15:24:34.127486 45413 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12703  repair needed (‹add voter›), enqueuing
I230510 15:24:34.127681 45414 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12704  next replica action: ‹add voter›
I230510 15:24:34.128003 45414 13@kv/kvserver/allocator/plan/replicate.go:440 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12705  adding voter n6,s6: ‹[1*:41, 7:41, 4:41]›
I230510 15:24:34.128814 45414 13@kv/kvserver/replica_command.go:2327 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12706  change replicas (add [(n6,s6):9] remove []): existing descriptor r64:‹/{Table/104-Max}› [(n1,s1):1, (n2,s2):7, (n3,s3):4, (n5,
s5):8NON_VOTER, (n6,s6):9NON_VOTER, next=10, gen=27]
I230510 15:24:34.134447 45414 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n1,s1,r64/1:‹/{Table/104-Max}›] 12707  proposing SIMPLE(v9) [(n6,s6):9]: after=[(n1,s1):1 (n2,s2):7 (n3,s3):4 (n5,s5):8NON_VOTER (n6,s6):9] next=10
I230510 15:24:34.135525 45414 13@kv/kvserver/replicate_queue.go:677 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12708  re-processing
I230510 15:24:34.135555 45414 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12709  repair needed (‹add voter›), enqueuing
I230510 15:24:34.135577 45414 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12710  repair needed (‹add voter›), enqueuing
I230510 15:24:34.135676 45384 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12711  next replica action: ‹add voter›
I230510 15:24:34.135847 45384 13@kv/kvserver/allocator/plan/replicate.go:440 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12712  adding voter n5,s5: ‹[1*:44, 7:44, 4:44, 9:44]›
I230510 15:24:34.136429 45384 13@kv/kvserver/replica_command.go:2327 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12713  change replicas (add [(n5,s5):8] remove []): existing descriptor r64:‹/{Table/104-Max}› [(n1,s1):1, (n2,s2):7, (n3,s3):4, (n5,
s5):8NON_VOTER, (n6,s6):9, next=10, gen=28]
I230510 15:24:34.145106 45384 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n1,s1,r64/1:‹/{Table/104-Max}›] 12714  proposing SIMPLE(v8) [(n5,s5):8]: after=[(n1,s1):1 (n2,s2):7 (n3,s3):4 (n5,s5):8 (n6,s6):9] next=10
I230510 15:24:34.146192 45384 13@kv/kvserver/replicate_queue.go:677 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12715  re-processing
I230510 15:24:34.146290 45384 13@kv/kvserver/allocator/allocatorimpl/allocator_scorer.go:1650 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12716  s1: should-rebalance(necessary/diversity=s5): oldNecessary:false, newNecessary:true, oldDiversity:0.8
50000, newDiversity:0.850000, locality:‹"region=2,node=5"›
I230510 15:24:34.146456 45384 13@kv/kvserver/allocator/plan/replicate.go:174 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12717  rebalance target found for voter, enqueuing
I230510 15:24:34.146516 45384 13@kv/kvserver/allocator/allocatorimpl/allocator_scorer.go:1650 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12718  s1: should-rebalance(necessary/diversity=s7): oldNecessary:false, newNecessary:false, oldDiversity:0.
850000, newDiversity:0.900000, locality:‹"region=3,node=7"›
I230510 15:24:34.146650 45384 13@kv/kvserver/allocator/plan/replicate.go:174 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12719  rebalance target found for voter, enqueuing
I230510 15:24:34.146742 45200 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12720  next replica action: ‹consider rebalance›
I230510 15:24:34.146816 45200 13@kv/kvserver/allocator/allocatorimpl/allocator_scorer.go:1650 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12721  s1: should-rebalance(necessary/diversity=s4): oldNecessary:false, newNecessary:true, oldDiversity:0.8
50000, newDiversity:0.900000, locality:‹"region=2,node=4"›

The behavior the test expects: promote two non-voters, s5 and s6 does occur initially. The problem is that there is subsequent rebalancing activity occurring for r64 - then incorrect removal of voters (after the rebalance transferring a lease), which are added back soon after.

The rebalancing activity moves a replica from s1 (lease as well) to s4, however something goes wrong shortly after on s4 ([(n4,s4):10 (n2,s2):7 (n3,s3):4 (n5,s5):8 (n6,s6):9] to get [(n2,s2):7 (n5,s5):8] as candidates for removal: ‹[10*:57, 7:0, 4 :57, 8:0, 9:57]:

Details

I230510 15:24:34.192881 45309 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12733  repair needed (‹remove voter›), enqueuing                                                                                                                      
I230510 15:24:34.193017 45310 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12734  next replica action: ‹remove voter›                                                                                                                             
I230510 15:24:34.193050 45310 13@kv/kvserver/allocator/plan/replicate.go:569 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12735  filtered unremovable replicas from [(n4,s4):10 (n2,s2):7 (n3,s3):4 (n5,s5):8 (n6,s6):9] to get [(n2,s2):7 (n5,s5):8] as candidates for removal: ‹[10*:57, 7:0, 4    :57, 8:0, 9:57]›

Which believes the range is overreplicated, despite the RF being set to 5 and there being 5 voters.

This is then amended:

Details

I230510 15:24:34.221861 45310 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n4,s4,r64/10:‹/{Table/104-Max}›] 12743  proposing SIMPLE(r8) [(n5,s5):8LEARNER]: after=[(n4,s4):10 (n2,s2):7 (n3,s3):4 (n6,s6):9] next=11
I230510 15:24:34.223015 45310 13@kv/kvserver/replicate_queue.go:677 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12744  re-processing
I230510 15:24:34.223048 45310 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12745  repair needed (‹add voter›), enqueuing
I230510 15:24:34.223079 45310 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12746  repair needed (‹add voter›), enqueuing
I230510 15:24:34.223180 45587 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12747  next replica action: ‹add voter›
I230510 15:24:34.223350 45587 13@kv/kvserver/allocator/plan/replicate.go:440 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12748  adding voter n5,s5: ‹[10*:63, 7:63, 4:63, 9:63]›
I230510 15:24:34.223958 45587 13@kv/kvserver/replica_command.go:2327 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12749  change replicas (add [(n5,s5):11LEARNER] remove []): existing descriptor r64:‹/{Table/104-Max}› [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, next=11, gen=36]
I230510 15:24:34.232586 45607 13@kv/kvserver/allocator/plan/replicate.go:192 ⋮ [T1,n3,replicate,s3,r19/6:‹/Table/1{7-8}›] 12750  no rebalance target found, not enqueuing
I230510 15:24:34.232979 45587 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n4,s4,r64/10:‹/{Table/104-Max}›] 12751  proposing SIMPLE(l11) [(n5,s5):11LEARNER]: after=[(n4,s4):10 (n2,s2):7 (n3,s3):4 (n6,s6):9 (n5,s5):11LEARNER] next=12
I230510 15:24:34.235381 45700 13@kv/kvserver/store_snapshot.go:1573 ⋮ [T1,n4,s4,r64/10:‹/{Table/104-Max}›] 12752  streamed INITIAL snapshot d0cc6aa9 at applied index 65 to (n5,s5):11LEARNER with 5.4 KiB in 0.00s @ 20 MiB/s: kvs=41 rangeKVs=0, rate-limit: 32 MiB/s, queued: 0.00s
I230510 15:24:34.235912 45703 13@kv/kvserver/replica_raftstorage.go:503 ⋮ [T1,n5,s5,r64/11:{-}] 12753  applying INITIAL snapshot d0cc6aa9 from (n4,s4):10 at applied index 65
I230510 15:24:34.239854 45587 13@kv/kvserver/replica_command.go:2327 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12754  change replicas (add [(n5,s5):11] remove []): existing descriptor r64:‹/{Table/104-Max}› [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, (n5,s5):11LEARNER, next=12, gen=3
7]
I230510 15:24:34.251319 45587 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n4,s4,r64/10:‹/{Table/104-Max}›] 12755  proposing SIMPLE(v11) [(n5,s5):11]: after=[(n4,s4):10 (n2,s2):7 (n3,s3):4 (n6,s6):9 (n5,s5):11] next=12
I230510 15:24:34.253674 45587 13@kv/kvserver/replicate_queue.go:677 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12756  re-processing
I230510 15:24:34.254056 45587 13@kv/kvserver/allocator/plan/replicate.go:192 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12757  no rebalance target found, not enqueuing

Which is where the add voter s5 comes from in the failure message.

So the question is why did s4 believe that r64 was overreplicated (looks like RF=3, when actualy RF=5). This could possibly be due to Mux rangefeeds which popped up in other replicate queue tests.

kvoli · 2023-05-10T17:02:16Z

We are running into a stale range config here, where the current config was set a while ago

Set here CONFIGURE ZONE USING num_replicas = 5, num_voters= 5 @ 16:44:26

Details

I230510 16:44:26.037911 27746 1@util/log/event_log.go:32  [T1,n1,client=127.0.0.1:59986,user=root] 992 ={"Timestamp":1683737066029271371,"EventType":"set_zone_config","Statement":"ALTER TABLE \"\".\"\".t CONFIGURE ZONE USING num_replicas = 5, num_voters
 = 5, constraints = '{}', voter_constraints = '{\"+region=1\": 2, \"+region=2\": 2, \"+region=3\": 1}'","Tag":"CONFIGURE ZONE","User":"root","DescriptorID":104,"Target":"TABLE defaultdb.public.t","Options":["constraints = '{}'","num_replicas = 5","num_v
oters = 5","voter_constraints = '{\"+region=1\": 2, \"+region=2\": 2, \"+region=3\": 1}'"]}

Old config used to remove voters incorrectly on s4@16:44:32 (+6s): num_replicas:5 num_voters:3 with descriptor ‹r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):2, (n1,s1):7, (n5,s5):8,(n6,s6):9, next=11, gen=37]›

Details

I230510 16:44:32.634078 45336 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12739  repair needed (‹remove voter›), enqueuing  
I230510 16:44:32.634107 45336 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12740  repair needed (‹remove voter›), enqueuing
I230510 16:44:32.634262 45052 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12741  next replica action: ‹remove voter›, conifg: ‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 > num_replicas:5 num_voters:3 c
onstraints:<num_replicas:1 constraints:<key:"region" value:"2" > > constraints:<num_replicas:1 constraints:<key:"region" value:"3" > > voter_constraints:<num_replicas:3 constraints:<key:"region" value:"1" > >›, descriptor: ‹r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):2, (n1,s1):7, (n5,s5):8,(n6,s6):9, next=11, gen=37]›
I230510 16:44:32.634316 45052 13@kv/kvserver/allocator/plan/replicate.go:569 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12742  filtered unremovable replicas from [(n4,s4):10 (n2,s2):2 (n1,s1):7 (n5,s5):8 (n6,s6):9] to get [(n4,s4):10 (n2,s2):2 (n1,s1):7 (n5,s5):8 (n6,s6):9] as candidates for removal: ‹[10*:58, 2:58, 7:58, 8:58, 9:58]›

erikgrinaker · 2023-05-10T17:04:37Z

Sounds like a mux rangefeed issue. Is changefeed.mux_rangefeed.enabled enabled metamorphically when this fails?

kvoli · 2023-05-10T17:25:40Z

We are initialized metamorphic constant "changefeed.mux_rangefeed.enabled" with value true

All metamorphic constants

initialized metamorphic constant "coldata-batch-size" with value 1114  +
initialized metamorphic constant "span-reuse-rate" with value 89
initialized metamorphic constant "storage.value_blocks.enabled" with value false
initialized metamorphic constant "mvcc-max-iters-before-seek" with value 1
initialized metamorphic constant "addsst-rewrite-concurrency" with value 10
initialized metamorphic constant "kv-batch-size" with value 1
initialized metamorphic constant "inverted-joiner-batch-size" with value 1
initialized metamorphic constant "ColIndexJoin-using-streamer-batch-size" with value 7634381
initialized metamorphic constant "parallel-scan-result-threshold" with value 8567
initialized metamorphic constant "copy-batch-size" with value 45962
initialized metamorphic constant "async-IE-result-channel-buffer-size" with value 23
initialized metamorphic constant "use-index-lookup-for-descriptors-in-database" with value false
initialized metamorphic constant "copy-fast-path-enabled-default" with value false
initialized metamorphic constant "backup-merge-file-buffer-size" with value 7243737
initialized metamorphic constant "restore-worker-concurrency" with value 1
initialized metamorphic constant "changefeed.mux_rangefeed.enabled" with value true
initialized metamorphic constant "parse-json-impl" with value 1
initialized metamorphic constant "coldata-batch-size" with value 2466
initialized metamorphic constant "COCKROACH_ENABLE_HDR_HISTOGRAMS" with value true
initialized metamorphic constant "mvcc-incremental-iter-tbi" with value false
initialized metamorphic constant "storage.value_blocks.enabled" with value false
initialized metamorphic constant "mvcc-max-iters-before-seek" with value 1
initialized metamorphic constant "disable-checksstconflicts-range-key-masking" with value true
initialized metamorphic constant "addsst-rewrite-concurrency" with value 6
initialized metamorphic constant "addsst-rewrite-forced" with value true
initialized metamorphic constant "raft-log-truncation-clearrange-threshold" with value 748694
initialized metamorphic constant "kv-batch-size" with value 1
initialized metamorphic constant "datum-row-converter-batch-size" with value 1
initialized metamorphic constant "row-container-rows-per-chunk-shift" with value 1
initialized metamorphic constant "inverted-joiner-batch-size" with value 1
initialized metamorphic constant "spilling-queue-initial-len" with value 6
initialized metamorphic constant "merge-joiner-groups-buffer" with value 6
initialized metamorphic constant "direct-scans-enabled" with value true
initialized metamorphic constant "ColIndexJoin-batch-size" with value 1455778
initialized metamorphic constant "max-batch-size" with value 6639
initialized metamorphic constant "resolve-membership-single-scan-enabled" with value false
initialized metamorphic constant "async-IE-result-channel-buffer-size" with value 13
initialized metamorphic constant "multiple_active_portals_enabled" with value true
initialized metamorphic constant "backup-merge-file-buffer-size" with value 9949933
initialized metamorphic constant "write-metadata-sst" with value true
initialized metamorphic constant "backup-read-manifest" with value true
initialized metamorphic constant "changefeed.event_consumer_worker_queue_size" with value 1
initialized metamorphic constant "changefeed.balance_range_distribution.enable" with value true
initialized metamorphic constant "changefeed.fast_gzip.enabled" with value false
initialized metamorphic constant "changefeed.new_webhook_sink_enabled" with value true
initialized metamorphic constant "changefeed.new_pubsub_sink_enabled" with value true
I230510 16:44:03.008018 1 (gostd) rand.go:243  [-] 1  random seed: -2224482614678722634

I'm leaning towards skipping this test and all other allocator related tests which use a test cluster under metamorphic constants like #101882.

I was stressing on 506a55a which has the fixes in #102094 - so this problem doesn't appear solved (if it is the mux rangefeed).

x-ref #100783
x-ref #99207

erikgrinaker · 2023-05-10T17:32:49Z

CC @miretskiy, this appears to reproduce rather easily.

erikgrinaker · 2023-05-10T17:33:50Z

Btw, the test could also consider just overriding the setting in question (always disabling mux), instead of skipping all metamorphic tests entirely. With a TODO to remove the override once mux rangefeeds are fixed.

kvoli · 2023-05-10T17:39:52Z

Btw, the test could also consider just overriding the setting in question (always disabling mux), instead of skipping all metamorphic tests entirely. With a TODO to remove the override once mux rangefeeds are fixed.

Yeah that sounds like a better solution - these failures have created a lot of noise / wasted time spent debugging, but if we're shipping these on by default in 23.2 we need to re-enable eventually.

If these ranges were larger, you could have under replication for a dangerous amount of time.

erikgrinaker · 2023-05-10T17:44:53Z

Yeah, sorry about that.

@miretskiy Do we need to disable this metamorphism again until we've fixed the problems? These constant flakes cause a fair bit of additional work for test triagers.

miretskiy · 2023-05-10T20:20:02Z

It would be very sad to disable mux on master -- we would have no signal or tests to know if the problems
are fixed. Note, that seeing mux enabled doesn't mean that it's actually enabled (I believe there was a recent fix for this.. #101895 (comment), but it doesn't look like it merged).

Regardless, this failure happens very infrequently it seems; 60+ minutes of stressing.
I wonder if we can try with use mux rangefeed simply set to false to see if we see the flake.
And of course, then flip it to true to see if we see it again.

kvoli · 2023-05-10T21:05:51Z

I wonder if we can try with use mux rangefeed simply set to false to see if we see the flake.

SGTM - I'm stressing atm with mux rangefeed disabled.

I'm slightly suspicious that it is mux rangefeed since these failures began a month after earlier flakes we traced to mux rangefeed in #99207 and this test has failed at least twice a week since the initial failure.

kvoli · 2023-05-11T00:12:28Z

It looks like mux rangefeeds were wrongly being blamed afaict.

I've tested with setting SET CLUSTER SETTING changefeed.mux_rangefeed.enabled = false and also disabling metamorphic skip.UnderMetamorphic(t) - in both cases, the failure still occurs after stressing for around 100 runs

Side note, stressing without metamorphic is 6x faster.

# w/ metamorphic enabled
97 runs_completed, 1 failures over 12m57s
# w/o metamorphic enabled
118 runs completed, 1 failures, over 2m52s

The issue that comes up is that the existing leaseholder recieves the span config update at t1, does the correct promotions, rebalances itself away, including the lease at t2 towards a new leaseholder. The new leaseholder hasn't received the new span config, proceeds to reek havoc until it receives the span config update at t3.

I'm surprised we haven't seen this more in tests which involve span configuration changes. I'm also not sure how we easily get around this without the current LH and new LH agreeing on a min span config sequence number before transferring.

I'm sorry we jumped at mux as a RC here @miretskiy! I've opened #103083 to deflake the test.

kvoli · 2023-05-11T01:38:49Z

Filed #103086 (comment).

Fixing the problem seems like it comes with its own issues, since it requires blocking a new leaseholder (or selecting one to begin with) from performing replication changes until it has at least the old leaseholders config version or greater.

Perhaps not an material issue if we still don't allow removing replicas during the period and constrain up replication to only handle dead nodes?

In any case, not a high priority problem to fix since we expect it to eventually resolve itself.

It is possible for span config updates to arrive at different times between stores. `TestPromoteNonVoterInAddVoter` was flaking when the incoming leaseholder would act upon a stale span config, before receiving the updated one which the outgoing leaseholder used. This resulted in the test failing as more than just the two expected promotion events appeared in the range log, as the incoming leaseholder removed voters, then subsequently added them back upon receiving the up to date span configuration. This commit checks on the prefix of the range log add voter events, to avoid failing the test when this untimely behavior occurs. Fixes: cockroachdb#101519 Release note: None

103077: docgen: add EXPLAIN, EXPLAIN ANALYZE options to diagrams r=michae2 a=taroface Add various `EXPLAIN` and `EXPLAIN ANALYZE` options that were documented but not exposed in the SQL diagrams. Also add `REDACT`, which will be documented with cockroachdb/docs#16929. Epic: none Release note: none Release justification: non-production code change 103083: kvserver: deflake TestPromoteNonVoterInAddVoter r=andrewbaptist a=kvoli It is possible for span config updates to arrive at different times between stores. `TestPromoteNonVoterInAddVoter` was flaking when the incoming leaseholder would act upon a stale span config, before receiving the updated one which the outgoing leaseholder used. This resulted in the test failing as more than just the two expected promotion events appeared in the range log, as the incoming leaseholder removed voters, then subsequently added them back upon receiving the up to date span configuration. #103086 tracks this issue. This PR checks on the prefix of the range log add voter events, to avoid failing the test when this untimely behavior occurs. Stressed overnight, removing the skip under stress flag: ``` dev test pkg/kv/kvserver -f TestPromoteNonVoterInAddVoter -v --stress --stress-args="-p 4" ... 27158 runs so far, 0 failures, over 12h56m55s ``` This PR also adds additional (v=6) logging of the range descriptor and span config, as these come in handy when debugging failures such as this. Fixes: #101519 Release note: None 103332: cloud/kubernetes: bump version to v23.1.0 r=ZhouXing19 a=ZhouXing19 Epic: None Release note: None 103335: util/version: update roachtest version for 23.2 to 23.1.0 r=ZhouXing19 a=ZhouXing19 Epic: None Release note: None 103338: ci: additionally build `workload` and `dev` for all Unix systems r=rail,srosenberg a=rickystewart This includes ARM machines and macOS systems. Epic: none Release note: None Co-authored-by: Ryan Kuo <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Jane Xing <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

It is possible for span config updates to arrive at different times between stores. `TestPromoteNonVoterInAddVoter` was flaking when the incoming leaseholder would act upon a stale span config, before receiving the updated one which the outgoing leaseholder used. This resulted in the test failing as more than just the two expected promotion events appeared in the range log, as the incoming leaseholder removed voters, then subsequently added them back upon receiving the up to date span configuration. This commit checks on the prefix of the range log add voter events, to avoid failing the test when this untimely behavior occurs. Fixes: cockroachdb#101519 Release note: None

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Apr 14, 2023

cockroach-teamcity added this to the 23.1 milestone Apr 14, 2023

blathers-crl bot added the T-kv KV Team label Apr 14, 2023

erikgrinaker mentioned this issue Apr 14, 2023

kvserver: flake in TestPromoteNonVoterInAddVoter #100880

Closed

erikgrinaker added the A-kv-distribution Relating to rebalancing and leasing. label Apr 14, 2023

yuzefovich mentioned this issue Apr 19, 2023

sql: fix internal executor when it encounters a retry error #101477

Merged

aadityasondhi mentioned this issue Apr 20, 2023

cli: unskip TestUnavailableZip #100678

Merged

exalate-issue-sync bot assigned kvoli Apr 24, 2023

kvoli mentioned this issue May 10, 2023

kvserver,allocator: disable mux rangefeed on tests using a test cluster #103057

Closed

kvoli mentioned this issue May 11, 2023

kvserver: inconsistent span configs between outgoing and incoming leaseholders causes temporary violations #103086

Open

kvoli mentioned this issue May 11, 2023

kvserver: deflake TestPromoteNonVoterInAddVoter #103083

Merged

craig bot closed this as completed in cf0faad May 15, 2023

This was referenced Jun 30, 2023

kvserver: TestPromoteNonVoterInAddVoter flaked on release-23.1 #105883

Closed

release-23.1: kvserver: deflake TestPromoteNonVoterInAddVoter #105889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/kvserver: TestPromoteNonVoterInAddVoter failed #101519

kv/kvserver: TestPromoteNonVoterInAddVoter failed #101519

cockroach-teamcity commented Apr 14, 2023 •

edited by cockroach-jira-scripts

Loading

cockroach-teamcity commented Apr 25, 2023

kvoli commented Apr 26, 2023 •

edited

Loading

cockroach-teamcity commented Apr 27, 2023

cockroach-teamcity commented May 8, 2023

kvoli commented May 8, 2023

cockroach-teamcity commented May 9, 2023

kvoli commented May 9, 2023

kvoli commented May 10, 2023

kvoli commented May 10, 2023 •

edited

Loading

kvoli commented May 10, 2023 •

edited

Loading

erikgrinaker commented May 10, 2023

kvoli commented May 10, 2023

erikgrinaker commented May 10, 2023

erikgrinaker commented May 10, 2023

kvoli commented May 10, 2023

erikgrinaker commented May 10, 2023 •

edited

Loading

miretskiy commented May 10, 2023

kvoli commented May 10, 2023

kvoli commented May 11, 2023

kvoli commented May 11, 2023

kv/kvserver: TestPromoteNonVoterInAddVoter failed #101519

kv/kvserver: TestPromoteNonVoterInAddVoter failed #101519

Comments

cockroach-teamcity commented Apr 14, 2023 • edited by cockroach-jira-scripts Loading

cockroach-teamcity commented Apr 25, 2023

kvoli commented Apr 26, 2023 • edited Loading

cockroach-teamcity commented Apr 27, 2023

cockroach-teamcity commented May 8, 2023

kvoli commented May 8, 2023

cockroach-teamcity commented May 9, 2023

kvoli commented May 9, 2023

kvoli commented May 10, 2023

kvoli commented May 10, 2023 • edited Loading

kvoli commented May 10, 2023 • edited Loading

erikgrinaker commented May 10, 2023

kvoli commented May 10, 2023

erikgrinaker commented May 10, 2023

erikgrinaker commented May 10, 2023

kvoli commented May 10, 2023

erikgrinaker commented May 10, 2023 • edited Loading

miretskiy commented May 10, 2023

kvoli commented May 10, 2023

kvoli commented May 11, 2023

kvoli commented May 11, 2023

cockroach-teamcity commented Apr 14, 2023 •

edited by cockroach-jira-scripts

Loading

kvoli commented Apr 26, 2023 •

edited

Loading

kvoli commented May 10, 2023 •

edited

Loading

kvoli commented May 10, 2023 •

edited

Loading

erikgrinaker commented May 10, 2023 •

edited

Loading