Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestPromoteNonVoterInAddVoter failed #101519

Closed
cockroach-teamcity opened this issue Apr 14, 2023 · 20 comments · Fixed by #103083
Closed

kv/kvserver: TestPromoteNonVoterInAddVoter failed #101519

cockroach-teamcity opened this issue Apr 14, 2023 · 20 comments · Fixed by #103083
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 14, 2023

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 8e6cdbd8b3d12ba5d65db101a3abfeb2cb1bf470:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/d486233d291fa43f53db749aea077e75/logTestPromoteNonVoterInAddVoter3018638515
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2349: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2349
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 6
        	            	Add voter events: [{r63:/{Table/104-Max} [(n3,s3):1, (n2,s2):2, (n1,s1):7, (n6,s6):8NON_VOTER, (n4,s4):9, next=10, gen=32] <nil> <nil> (n4,s4):9 <nil> range under-replicated {"Target":"s4, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}} {r63:/{Table/104-Max} [(n3,s3):1, (n2,s2):2, (n1,s1):7, (n6,s6):8, (n4,s4):9, next=10, gen=33] <nil> <nil> (n6,s6):8 <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}} {r63:/{Table/104-Max} [(n5,s5):10, (n2,s2):2, (n1,s1):7, (n6,s6):11LEARNER, next=12, gen=44] <nil> <nil> (n6,s6):11LEARNER <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:61"}} {r63:/{Table/104-Max} [(n5,s5):10, (n2,s2):2, (n1,s1):7, (n6,s6):11, next=12, gen=45] <nil> <nil> (n6,s6):11 <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:61"}} {r63:/{Table/104-Max} [(n5,s5):10, (n2,s2):2, (n1,s1):7, (n6,s6):11, (n4,s4):12LEARNER, next=13, gen=46] <nil> <nil> (n4,s4):12LEARNER <nil> range under-replicated {"Target":"s4, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:61"}} {r63:/{Table/104-Max} [(n5,s5):10, (n2,s2):2, (n1,s1):7, (n6,s6):11, (n4,s4):12, next=13, gen=47] <nil> <nil> (n4,s4):12 <nil> range under-replicated {"Target":"s4, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:61"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/d486233d291fa43f53db749aea077e75/logTestPromoteNonVoterInAddVoter3018638515
--- FAIL: TestPromoteNonVoterInAddVoter (30.90s)
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-26986

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Apr 14, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Apr 14, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 14, 2023
@erikgrinaker erikgrinaker added the A-kv-distribution Relating to rebalancing and leasing. label Apr 14, 2023
@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 4fb894e0398f21a3f7d0ea9f289718ed3ac65fdd:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestPromoteNonVoterInAddVoter71289121
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2350: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2350
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 7
        	            	Add voter events: [{r64:/{Table/104-Max} [(n1,s1):1, (n3,s3):7, (n2,s2):4, (n5,s5):8, (n7,s7):9NON_VOTER, next=10, gen=28] <nil> <nil> (n5,s5):8 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n1,s1):1, (n3,s3):7, (n2,s2):4, (n5,s5):8, (n7,s7):9, next=10, gen=29] <nil> <nil> (n7,s7):9 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n4,s4):10, (n3,s3):7, (n2,s2):4, (n5,s5):8, (n7,s7):11LEARNER, next=12, gen=37] <nil> <nil> (n7,s7):11LEARNER <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}} {r64:/{Table/104-Max} [(n4,s4):10, (n3,s3):7, (n2,s2):4, (n5,s5):8, (n7,s7):11, next=12, gen=38] <nil> <nil> (n7,s7):11 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestPromoteNonVoterInAddVoter71289121
--- FAIL: TestPromoteNonVoterInAddVoter (32.55s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented Apr 26, 2023

Tried reproducing:

dev test pkg/kv/kvserver -f TestPromoteNonVoterInAddVoter --stress -v --vmodule=allocator=2,replicate_queue=2,allocator_scorer=2
...
20270 runs so far, 0 failures, over 14m30s
dev test pkg/kv/kvserver -f TestPromoteNonVoterInAddVoter --stress --race -v --vmodule=allocator=2,replicate_queue=2,allocator_scorer=2
...
3908 runs so far, 0 failures, over 14m0s

No success so far.

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 4bbe2faae22d21a650fbdbc280e0f3b2ad81aaf6:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/b5b16660ca668d68f7803e1ac963747e/logTestPromoteNonVoterInAddVoter3656602792
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2350: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2350
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 5
        	            	Add voter events: [{r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):4, (n3,s3):7, (n7,s7):8NON_VOTER, (n4,s4):9, next=10, gen=28] <nil> <nil> (n4,s4):9 <nil> range under-replicated {"Target":"s4, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):4, (n3,s3):7, (n7,s7):8, (n4,s4):9, next=10, gen=29] <nil> <nil> (n7,s7):8 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n4,s4):9, (n2,s2):4, (n3,s3):7, (n7,s7):8, (n5,s5):11LEARNER, next=12, gen=37] <nil> <nil> (n5,s5):11LEARNER <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}} {r64:/{Table/104-Max} [(n4,s4):9, (n2,s2):4, (n3,s3):7, (n7,s7):8, (n5,s5):11, next=12, gen=38] <nil> <nil> (n5,s5):11 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/b5b16660ca668d68f7803e1ac963747e/logTestPromoteNonVoterInAddVoter3656602792
--- FAIL: TestPromoteNonVoterInAddVoter (32.09s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ adad57859a5281e2846082761c8232890b3d2a68:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/8af8a6c853877a083ec84004541c621b/logTestPromoteNonVoterInAddVoter1191499411
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2350: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2350
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 7
        	            	Add voter events: [{r64:/{Table/104-Max} [(n3,s3):1, (n2,s2):4, (n1,s1):7, (n5,s5):8, (n7,s7):9NON_VOTER, next=10, gen=32] <nil> <nil> (n5,s5):8 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n3,s3):1, (n2,s2):4, (n1,s1):7, (n5,s5):8, (n7,s7):9, next=10, gen=33] <nil> <nil> (n7,s7):9 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):4, (n1,s1):7, (n5,s5):8, (n7,s7):11LEARNER, next=12, gen=41] <nil> <nil> (n7,s7):11LEARNER <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}} {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):4, (n1,s1):7, (n5,s5):8, (n7,s7):11, next=12, gen=42] <nil> <nil> (n7,s7):11 <nil> range under-replicated {"Target":"s7, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:62"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/8af8a6c853877a083ec84004541c621b/logTestPromoteNonVoterInAddVoter1191499411
--- FAIL: TestPromoteNonVoterInAddVoter (30.29s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented May 8, 2023

Tried again and no success

dev test pkg/kv/kvserver -f TestPromoteNonVoterInAddVoter --stress --race -v --vmodule=allocator=2,replicate_queue=2,allocator_scorer=2
...
14030 runs so far, 0 failures, over 49m15s

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 0d890b56341ca40b5ffb37c51d6b63bc322e80f4:

=== RUN   TestPromoteNonVoterInAddVoter
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestPromoteNonVoterInAddVoter1212694418
    test_log_scope.go:79: use -show-logs to present logs inline
    replicate_queue_test.go:2357: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2357
        	Error:      	Expected to promote non-voter, instead added voter
        	Test:       	TestPromoteNonVoterInAddVoter
        	Messages:   	Added voter store ID: 5
        	            	Add voter events: [{r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):4, (n3,s3):7, (n6,s6):8, (n5,s5):9NON_VOTER, next=10, gen=28] <nil> <nil> (n6,s6):8 <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):4, (n3,s3):7, (n6,s6):8, (n5,s5):9, next=10, gen=29] <nil> <nil> (n5,s5):9 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}} {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):4, (n3,s3):7, (n6,s6):8, (n5,s5):11LEARNER, next=12, gen=37] <nil> <nil> (n5,s5):11LEARNER <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}} {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):4, (n3,s3):7, (n6,s6):8, (n5,s5):11, next=12, gen=38] <nil> <nil> (n5,s5):11 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestPromoteNonVoterInAddVoter1212694418
--- FAIL: TestPromoteNonVoterInAddVoter (31.10s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented May 9, 2023

My bad - the failure wasn't reproducing because of...

Re-stressing now - I'm going to try and get to the bottom of this one.

@kvoli
Copy link
Collaborator

kvoli commented May 10, 2023

Again stressed for 60 minutes (sequentially, not actually stressing) and no failure.

@kvoli
Copy link
Collaborator

kvoli commented May 10, 2023

Got a failure

     replicate_queue_test.go:2355:
                 Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:2355
                 Error:          Expected to promote non-voter, instead added voter
                 Test:           TestPromoteNonVoterInAddVoter
                 Messages:       Added voter store ID: 5
                                 Add voter events: [
 {r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):7, (n3,s3):4, (n5,s5):8NON_VOTER, (n6,s6):9, next=10, gen=28] <nil> <nil> (n6,s6):9 <nil> range under-replicated {"Target":"s6, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}}
 
 {r64:/{Table/104-Max} [(n1,s1):1, (n2,s2):7, (n3,s3):4, (n5,s5):8, (n6,s6):9, next=10, gen=29] <nil> <nil> (n5,s5):8 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:1.00, converges:0, balance:0, rangeCount:63"}}
 
 {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, (n5,s5):11LEARNER, next=12, gen=37] <nil> <nil> (n5,s5):11LEARNER <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}
 
 {r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, (n5,s5):11, next=12, gen=38] <nil> <nil> (n5,s5):11 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}]
Relevant logs from the replicate queue
{r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, (n5,s5):11, next=12, gen=38] <nil> <nil> (n5,s5):11 <nil> range under-replicated {"Target":"s5, necessary:true, diversity:0.88, converges:0, balance:0, rangeCount:62"}}]  +  +
230510 15:24:34.127486 45413 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12703  repair needed (‹add voter›), enqueuing
I230510 15:24:34.127681 45414 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12704  next replica action: ‹add voter›
I230510 15:24:34.128003 45414 13@kv/kvserver/allocator/plan/replicate.go:440 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12705  adding voter n6,s6: ‹[1*:41, 7:41, 4:41]›
I230510 15:24:34.128814 45414 13@kv/kvserver/replica_command.go:2327 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12706  change replicas (add [(n6,s6):9] remove []): existing descriptor r64:‹/{Table/104-Max}› [(n1,s1):1, (n2,s2):7, (n3,s3):4, (n5,
s5):8NON_VOTER, (n6,s6):9NON_VOTER, next=10, gen=27]
I230510 15:24:34.134447 45414 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n1,s1,r64/1:‹/{Table/104-Max}›] 12707  proposing SIMPLE(v9) [(n6,s6):9]: after=[(n1,s1):1 (n2,s2):7 (n3,s3):4 (n5,s5):8NON_VOTER (n6,s6):9] next=10
I230510 15:24:34.135525 45414 13@kv/kvserver/replicate_queue.go:677 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12708  re-processing
I230510 15:24:34.135555 45414 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12709  repair needed (‹add voter›), enqueuing
I230510 15:24:34.135577 45414 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12710  repair needed (‹add voter›), enqueuing
I230510 15:24:34.135676 45384 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12711  next replica action: ‹add voter›
I230510 15:24:34.135847 45384 13@kv/kvserver/allocator/plan/replicate.go:440 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12712  adding voter n5,s5: ‹[1*:44, 7:44, 4:44, 9:44]›
I230510 15:24:34.136429 45384 13@kv/kvserver/replica_command.go:2327 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12713  change replicas (add [(n5,s5):8] remove []): existing descriptor r64:‹/{Table/104-Max}› [(n1,s1):1, (n2,s2):7, (n3,s3):4, (n5,
s5):8NON_VOTER, (n6,s6):9, next=10, gen=28]
I230510 15:24:34.145106 45384 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n1,s1,r64/1:‹/{Table/104-Max}›] 12714  proposing SIMPLE(v8) [(n5,s5):8]: after=[(n1,s1):1 (n2,s2):7 (n3,s3):4 (n5,s5):8 (n6,s6):9] next=10
I230510 15:24:34.146192 45384 13@kv/kvserver/replicate_queue.go:677 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12715  re-processing
I230510 15:24:34.146290 45384 13@kv/kvserver/allocator/allocatorimpl/allocator_scorer.go:1650 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12716  s1: should-rebalance(necessary/diversity=s5): oldNecessary:false, newNecessary:true, oldDiversity:0.8
50000, newDiversity:0.850000, locality:‹"region=2,node=5"›
I230510 15:24:34.146456 45384 13@kv/kvserver/allocator/plan/replicate.go:174 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12717  rebalance target found for voter, enqueuing
I230510 15:24:34.146516 45384 13@kv/kvserver/allocator/allocatorimpl/allocator_scorer.go:1650 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12718  s1: should-rebalance(necessary/diversity=s7): oldNecessary:false, newNecessary:false, oldDiversity:0.
850000, newDiversity:0.900000, locality:‹"region=3,node=7"›
I230510 15:24:34.146650 45384 13@kv/kvserver/allocator/plan/replicate.go:174 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12719  rebalance target found for voter, enqueuing
I230510 15:24:34.146742 45200 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12720  next replica action: ‹consider rebalance›
I230510 15:24:34.146816 45200 13@kv/kvserver/allocator/allocatorimpl/allocator_scorer.go:1650 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/104-Max}›] 12721  s1: should-rebalance(necessary/diversity=s4): oldNecessary:false, newNecessary:true, oldDiversity:0.8
50000, newDiversity:0.900000, locality:‹"region=2,node=4"›

The behavior the test expects: promote two non-voters, s5 and s6 does occur initially. The problem is that there is subsequent rebalancing activity occurring for r64 - then incorrect removal of voters (after the rebalance transferring a lease), which are added back soon after.

The rebalancing activity moves a replica from s1 (lease as well) to s4, however something goes wrong shortly after on s4 ([(n4,s4):10 (n2,s2):7 (n3,s3):4 (n5,s5):8 (n6,s6):9] to get [(n2,s2):7 (n5,s5):8] as candidates for removal: ‹[10*:57, 7:0, 4 :57, 8:0, 9:57]:

Details
I230510 15:24:34.192881 45309 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12733  repair needed (‹remove voter›), enqueuing                                                                                                                      
I230510 15:24:34.193017 45310 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12734  next replica action: ‹remove voter›                                                                                                                             
I230510 15:24:34.193050 45310 13@kv/kvserver/allocator/plan/replicate.go:569 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12735  filtered unremovable replicas from [(n4,s4):10 (n2,s2):7 (n3,s3):4 (n5,s5):8 (n6,s6):9] to get [(n2,s2):7 (n5,s5):8] as candidates for removal: ‹[10*:57, 7:0, 4    :57, 8:0, 9:57]› 

Which believes the range is overreplicated, despite the RF being set to 5 and there being 5 voters.

This is then amended:

Details
I230510 15:24:34.221861 45310 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n4,s4,r64/10:‹/{Table/104-Max}›] 12743  proposing SIMPLE(r8) [(n5,s5):8LEARNER]: after=[(n4,s4):10 (n2,s2):7 (n3,s3):4 (n6,s6):9] next=11
I230510 15:24:34.223015 45310 13@kv/kvserver/replicate_queue.go:677 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12744  re-processing
I230510 15:24:34.223048 45310 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12745  repair needed (‹add voter›), enqueuing
I230510 15:24:34.223079 45310 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12746  repair needed (‹add voter›), enqueuing
I230510 15:24:34.223180 45587 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12747  next replica action: ‹add voter›
I230510 15:24:34.223350 45587 13@kv/kvserver/allocator/plan/replicate.go:440 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12748  adding voter n5,s5: ‹[10*:63, 7:63, 4:63, 9:63]›
I230510 15:24:34.223958 45587 13@kv/kvserver/replica_command.go:2327 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12749  change replicas (add [(n5,s5):11LEARNER] remove []): existing descriptor r64:‹/{Table/104-Max}› [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, next=11, gen=36]
I230510 15:24:34.232586 45607 13@kv/kvserver/allocator/plan/replicate.go:192 ⋮ [T1,n3,replicate,s3,r19/6:‹/Table/1{7-8}›] 12750  no rebalance target found, not enqueuing
I230510 15:24:34.232979 45587 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n4,s4,r64/10:‹/{Table/104-Max}›] 12751  proposing SIMPLE(l11) [(n5,s5):11LEARNER]: after=[(n4,s4):10 (n2,s2):7 (n3,s3):4 (n6,s6):9 (n5,s5):11LEARNER] next=12
I230510 15:24:34.235381 45700 13@kv/kvserver/store_snapshot.go:1573 ⋮ [T1,n4,s4,r64/10:‹/{Table/104-Max}›] 12752  streamed INITIAL snapshot d0cc6aa9 at applied index 65 to (n5,s5):11LEARNER with 5.4 KiB in 0.00s @ 20 MiB/s: kvs=41 rangeKVs=0, rate-limit: 32 MiB/s, queued: 0.00s
I230510 15:24:34.235912 45703 13@kv/kvserver/replica_raftstorage.go:503 ⋮ [T1,n5,s5,r64/11:{-}] 12753  applying INITIAL snapshot d0cc6aa9 from (n4,s4):10 at applied index 65
I230510 15:24:34.239854 45587 13@kv/kvserver/replica_command.go:2327 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12754  change replicas (add [(n5,s5):11] remove []): existing descriptor r64:‹/{Table/104-Max}› [(n4,s4):10, (n2,s2):7, (n3,s3):4, (n6,s6):9, (n5,s5):11LEARNER, next=12, gen=3
7]
I230510 15:24:34.251319 45587 13@kv/kvserver/replica_raft.go:380 ⋮ [T1,n4,s4,r64/10:‹/{Table/104-Max}›] 12755  proposing SIMPLE(v11) [(n5,s5):11]: after=[(n4,s4):10 (n2,s2):7 (n3,s3):4 (n6,s6):9 (n5,s5):11] next=12
I230510 15:24:34.253674 45587 13@kv/kvserver/replicate_queue.go:677 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12756  re-processing
I230510 15:24:34.254056 45587 13@kv/kvserver/allocator/plan/replicate.go:192 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12757  no rebalance target found, not enqueuing

Which is where the add voter s5 comes from in the failure message.

So the question is why did s4 believe that r64 was overreplicated (looks like RF=3, when actualy RF=5). This could possibly be due to Mux rangefeeds which popped up in other replicate queue tests.

@kvoli
Copy link
Collaborator

kvoli commented May 10, 2023

We are running into a stale range config here, where the current config was set a while ago

Set here CONFIGURE ZONE USING num_replicas = 5, num_voters= 5 @ 16:44:26

Details
I230510 16:44:26.037911 27746 1@util/log/event_log.go:32  [T1,n1,client=127.0.0.1:59986,user=root] 992 ={"Timestamp":1683737066029271371,"EventType":"set_zone_config","Statement":"ALTER TABLE \"\".\"\".t CONFIGURE ZONE USING num_replicas = 5, num_voters
 = 5, constraints = '{}', voter_constraints = '{\"+region=1\": 2, \"+region=2\": 2, \"+region=3\": 1}'","Tag":"CONFIGURE ZONE","User":"root","DescriptorID":104,"Target":"TABLE defaultdb.public.t","Options":["constraints = '{}'","num_replicas = 5","num_v
oters = 5","voter_constraints = '{\"+region=1\": 2, \"+region=2\": 2, \"+region=3\": 1}'"]}

Old config used to remove voters incorrectly on s4@16:44:32 (+6s): num_replicas:5 num_voters:3 with descriptor ‹r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):2, (n1,s1):7, (n5,s5):8,(n6,s6):9, next=11, gen=37]›

Details
I230510 16:44:32.634078 45336 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12739  repair needed (‹remove voter›), enqueuing  
I230510 16:44:32.634107 45336 13@kv/kvserver/allocator/plan/replicate.go:153 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12740  repair needed (‹remove voter›), enqueuing
I230510 16:44:32.634262 45052 13@kv/kvserver/allocator/plan/replicate.go:260 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12741  next replica action: ‹remove voter›, conifg: ‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 > num_replicas:5 num_voters:3 c
onstraints:<num_replicas:1 constraints:<key:"region" value:"2" > > constraints:<num_replicas:1 constraints:<key:"region" value:"3" > > voter_constraints:<num_replicas:3 constraints:<key:"region" value:"1" > >›, descriptor: ‹r64:/{Table/104-Max} [(n4,s4):10, (n2,s2):2, (n1,s1):7, (n5,s5):8,(n6,s6):9, next=11, gen=37]›
I230510 16:44:32.634316 45052 13@kv/kvserver/allocator/plan/replicate.go:569 ⋮ [T1,n4,replicate,s4,r64/10:‹/{Table/104-Max}›] 12742  filtered unremovable replicas from [(n4,s4):10 (n2,s2):2 (n1,s1):7 (n5,s5):8 (n6,s6):9] to get [(n4,s4):10 (n2,s2):2 (n1,s1):7 (n5,s5):8 (n6,s6):9] as candidates for removal: ‹[10*:58, 2:58, 7:58, 8:58, 9:58]›

@erikgrinaker
Copy link
Contributor

Sounds like a mux rangefeed issue. Is changefeed.mux_rangefeed.enabled enabled metamorphically when this fails?

@kvoli
Copy link
Collaborator

kvoli commented May 10, 2023

We are initialized metamorphic constant "changefeed.mux_rangefeed.enabled" with value true

All metamorphic constants
initialized metamorphic constant "coldata-batch-size" with value 1114  +
initialized metamorphic constant "span-reuse-rate" with value 89
initialized metamorphic constant "storage.value_blocks.enabled" with value false
initialized metamorphic constant "mvcc-max-iters-before-seek" with value 1
initialized metamorphic constant "addsst-rewrite-concurrency" with value 10
initialized metamorphic constant "kv-batch-size" with value 1
initialized metamorphic constant "inverted-joiner-batch-size" with value 1
initialized metamorphic constant "ColIndexJoin-using-streamer-batch-size" with value 7634381
initialized metamorphic constant "parallel-scan-result-threshold" with value 8567
initialized metamorphic constant "copy-batch-size" with value 45962
initialized metamorphic constant "async-IE-result-channel-buffer-size" with value 23
initialized metamorphic constant "use-index-lookup-for-descriptors-in-database" with value false
initialized metamorphic constant "copy-fast-path-enabled-default" with value false
initialized metamorphic constant "backup-merge-file-buffer-size" with value 7243737
initialized metamorphic constant "restore-worker-concurrency" with value 1
initialized metamorphic constant "changefeed.mux_rangefeed.enabled" with value true
initialized metamorphic constant "parse-json-impl" with value 1
initialized metamorphic constant "coldata-batch-size" with value 2466
initialized metamorphic constant "COCKROACH_ENABLE_HDR_HISTOGRAMS" with value true
initialized metamorphic constant "mvcc-incremental-iter-tbi" with value false
initialized metamorphic constant "storage.value_blocks.enabled" with value false
initialized metamorphic constant "mvcc-max-iters-before-seek" with value 1
initialized metamorphic constant "disable-checksstconflicts-range-key-masking" with value true
initialized metamorphic constant "addsst-rewrite-concurrency" with value 6
initialized metamorphic constant "addsst-rewrite-forced" with value true
initialized metamorphic constant "raft-log-truncation-clearrange-threshold" with value 748694
initialized metamorphic constant "kv-batch-size" with value 1
initialized metamorphic constant "datum-row-converter-batch-size" with value 1
initialized metamorphic constant "row-container-rows-per-chunk-shift" with value 1
initialized metamorphic constant "inverted-joiner-batch-size" with value 1
initialized metamorphic constant "spilling-queue-initial-len" with value 6
initialized metamorphic constant "merge-joiner-groups-buffer" with value 6
initialized metamorphic constant "direct-scans-enabled" with value true
initialized metamorphic constant "ColIndexJoin-batch-size" with value 1455778
initialized metamorphic constant "max-batch-size" with value 6639
initialized metamorphic constant "resolve-membership-single-scan-enabled" with value false
initialized metamorphic constant "async-IE-result-channel-buffer-size" with value 13
initialized metamorphic constant "multiple_active_portals_enabled" with value true
initialized metamorphic constant "backup-merge-file-buffer-size" with value 9949933
initialized metamorphic constant "write-metadata-sst" with value true
initialized metamorphic constant "backup-read-manifest" with value true
initialized metamorphic constant "changefeed.event_consumer_worker_queue_size" with value 1
initialized metamorphic constant "changefeed.balance_range_distribution.enable" with value true
initialized metamorphic constant "changefeed.fast_gzip.enabled" with value false
initialized metamorphic constant "changefeed.new_webhook_sink_enabled" with value true
initialized metamorphic constant "changefeed.new_pubsub_sink_enabled" with value true
I230510 16:44:03.008018 1 (gostd) rand.go:243  [-] 1  random seed: -2224482614678722634

I'm leaning towards skipping this test and all other allocator related tests which use a test cluster under metamorphic constants like #101882.

I was stressing on 506a55a which has the fixes in #102094 - so this problem doesn't appear solved (if it is the mux rangefeed).

x-ref #100783
x-ref #99207

@erikgrinaker
Copy link
Contributor

CC @miretskiy, this appears to reproduce rather easily.

@erikgrinaker
Copy link
Contributor

Btw, the test could also consider just overriding the setting in question (always disabling mux), instead of skipping all metamorphic tests entirely. With a TODO to remove the override once mux rangefeeds are fixed.

@kvoli
Copy link
Collaborator

kvoli commented May 10, 2023

Btw, the test could also consider just overriding the setting in question (always disabling mux), instead of skipping all metamorphic tests entirely. With a TODO to remove the override once mux rangefeeds are fixed.

Yeah that sounds like a better solution - these failures have created a lot of noise / wasted time spent debugging, but if we're shipping these on by default in 23.2 we need to re-enable eventually.

If these ranges were larger, you could have under replication for a dangerous amount of time.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 10, 2023

Yeah, sorry about that.

@miretskiy Do we need to disable this metamorphism again until we've fixed the problems? These constant flakes cause a fair bit of additional work for test triagers.

@miretskiy
Copy link
Contributor

It would be very sad to disable mux on master -- we would have no signal or tests to know if the problems
are fixed. Note, that seeing mux enabled doesn't mean that it's actually enabled (I believe there was a recent fix for this.. #101895 (comment), but it doesn't look like it merged).

Regardless, this failure happens very infrequently it seems; 60+ minutes of stressing.
I wonder if we can try with use mux rangefeed simply set to false to see if we see the flake.
And of course, then flip it to true to see if we see it again.

@kvoli
Copy link
Collaborator

kvoli commented May 10, 2023

I wonder if we can try with use mux rangefeed simply set to false to see if we see the flake.

SGTM - I'm stressing atm with mux rangefeed disabled.

I'm slightly suspicious that it is mux rangefeed since these failures began a month after earlier flakes we traced to mux rangefeed in #99207 and this test has failed at least twice a week since the initial failure.

@kvoli
Copy link
Collaborator

kvoli commented May 11, 2023

It looks like mux rangefeeds were wrongly being blamed afaict.

I've tested with setting SET CLUSTER SETTING changefeed.mux_rangefeed.enabled = false and also disabling metamorphic skip.UnderMetamorphic(t) - in both cases, the failure still occurs after stressing for around 100 runs

Side note, stressing without metamorphic is 6x faster.
# w/ metamorphic enabled
97 runs_completed, 1 failures over 12m57s
# w/o metamorphic enabled
118 runs completed, 1 failures, over 2m52s

The issue that comes up is that the existing leaseholder recieves the span config update at t1, does the correct promotions, rebalances itself away, including the lease at t2 towards a new leaseholder. The new leaseholder hasn't received the new span config, proceeds to reek havoc until it receives the span config update at t3.

I'm surprised we haven't seen this more in tests which involve span configuration changes. I'm also not sure how we easily get around this without the current LH and new LH agreeing on a min span config sequence number before transferring.

I'm sorry we jumped at mux as a RC here @miretskiy! I've opened #103083 to deflake the test.

@kvoli
Copy link
Collaborator

kvoli commented May 11, 2023

Filed #103086 (comment).

Fixing the problem seems like it comes with its own issues, since it requires blocking a new leaseholder (or selecting one to begin with) from performing replication changes until it has at least the old leaseholders config version or greater.

Perhaps not an material issue if we still don't allow removing replicas during the period and constrain up replication to only handle dead nodes?

In any case, not a high priority problem to fix since we expect it to eventually resolve itself.

kvoli added a commit to kvoli/cockroach that referenced this issue May 11, 2023
It is possible for span config updates to arrive at different times
between stores. `TestPromoteNonVoterInAddVoter` was flaking when the
incoming leaseholder would act upon a stale span config, before
receiving the updated one which the outgoing leaseholder used.

This resulted in the test failing as more than just the two expected
promotion events appeared in the range log, as the incoming leaseholder
removed voters, then subsequently added them back upon receiving the
up to date span configuration.

This commit checks on the prefix of the range log add voter events, to
avoid failing the test when this untimely behavior occurs.

Fixes: cockroachdb#101519

Release note: None
craig bot pushed a commit that referenced this issue May 15, 2023
103077: docgen: add EXPLAIN, EXPLAIN ANALYZE options to diagrams r=michae2 a=taroface

Add various `EXPLAIN` and `EXPLAIN ANALYZE` options that were documented but not exposed in the SQL diagrams. Also add `REDACT`, which will be documented with cockroachdb/docs#16929.

Epic: none

Release note: none

Release justification: non-production code change

103083: kvserver: deflake TestPromoteNonVoterInAddVoter r=andrewbaptist a=kvoli

It is possible for span config updates to arrive at different times
between stores. `TestPromoteNonVoterInAddVoter` was flaking when the
incoming leaseholder would act upon a stale span config, before
receiving the updated one which the outgoing leaseholder used.

This resulted in the test failing as more than just the two expected
promotion events appeared in the range log, as the incoming leaseholder
removed voters, then subsequently added them back upon receiving the
up to date span configuration. 

#103086 tracks this issue.

This PR checks on the prefix of the range log add voter events, to
avoid failing the test when this untimely behavior occurs.

Stressed overnight, removing the skip under stress flag:

```
dev test pkg/kv/kvserver -f TestPromoteNonVoterInAddVoter -v --stress --stress-args="-p 4"
...
27158 runs so far, 0 failures, over 12h56m55s
```

This PR also adds additional (v=6) logging of the range descriptor and span config,
as these come in handy when debugging failures such as this.

Fixes: #101519

Release note: None

103332: cloud/kubernetes: bump version to v23.1.0 r=ZhouXing19 a=ZhouXing19

Epic: None
Release note: None

103335: util/version: update roachtest version for 23.2 to 23.1.0 r=ZhouXing19 a=ZhouXing19

Epic: None
Release note: None

103338: ci: additionally build `workload` and `dev` for all Unix systems r=rail,srosenberg a=rickystewart

This includes ARM machines and macOS systems.

Epic: none
Release note: None

Co-authored-by: Ryan Kuo <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Jane Xing <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@craig craig bot closed this as completed in cf0faad May 15, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue Jun 30, 2023
It is possible for span config updates to arrive at different times
between stores. `TestPromoteNonVoterInAddVoter` was flaking when the
incoming leaseholder would act upon a stale span config, before
receiving the updated one which the outgoing leaseholder used.

This resulted in the test failing as more than just the two expected
promotion events appeared in the range log, as the incoming leaseholder
removed voters, then subsequently added them back upon receiving the
up to date span configuration.

This commit checks on the prefix of the range log add voter events, to
avoid failing the test when this untimely behavior occurs.

Fixes: cockroachdb#101519

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants