Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: TestRemoveDeadReplicas failed #75133

Closed
Tracked by #75639
cockroach-teamcity opened this issue Jan 19, 2022 · 28 comments · Fixed by #89150
Closed
Tracked by #75639

cli: TestRemoveDeadReplicas failed #75133

cockroach-teamcity opened this issue Jan 19, 2022 · 28 comments · Fixed by #89150
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) skipped-test

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 19, 2022

cli.TestRemoveDeadReplicas failed with artifacts on master @ 912964e02ddd951c77d4f71981ae18b3894e9084:

replica has lost quorum, recovering: r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r2:/System/NodeLiveness{-Max} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r3:/System/{NodeLivenessMax-tsd} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r3:/System/{NodeLivenessMax-tsd} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r4:/System{/tsd-tse} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r4:/System{/tsd-tse} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r5:/{Systemtse-Table/SystemConfigSpan/Start} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r5:/{Systemtse-Table/SystemConfigSpan/Start} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r6:/Table/{SystemConfigSpan/Start-11} [(n3,s3):4, (n4,s4):2, (n2,s2):3, (n1,s1):5, next=6, gen=10] -> r6:/Table/{SystemConfigSpan/Start-11} [(n2,s2):6, next=7, gen=10]
replica has lost quorum, recovering: r7:/Table/1{1-2} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r7:/Table/1{1-2} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r8:/Table/1{2-3} [(n3,s3):4, (n2,s2):2, (n4,s4):3, (n1,s1):5, next=6, gen=10] -> r8:/Table/1{2-3} [(n2,s2):6, next=7, gen=10]
replica has lost quorum, recovering: r9:/Table/1{3-4} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r9:/Table/1{3-4} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r10:/Table/1{4-5} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r10:/Table/1{4-5} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r11:/Table/1{5-6} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r11:/Table/1{5-6} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r12:/Table/1{6-7} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r12:/Table/1{6-7} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r13:/Table/1{7-8} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r13:/Table/1{7-8} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r14:/Table/1{8-9} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r14:/Table/1{8-9} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r15:/Table/{19-20} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r15:/Table/{19-20} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r16:/Table/2{0-1} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r16:/Table/2{0-1} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r17:/Table/2{1-2} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r17:/Table/2{1-2} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r18:/Table/2{2-3} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r18:/Table/2{2-3} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r19:/Table/2{3-4} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r19:/Table/2{3-4} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r20:/Table/2{4-5} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r20:/Table/2{4-5} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r21:/Table/2{5-6} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r21:/Table/2{5-6} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r22:/Table/2{6-7} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r22:/Table/2{6-7} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r23:/Table/2{7-8} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r23:/Table/2{7-8} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r24:/Table/2{8-9} [(n3,s3):4, (n2,s2):2, (n4,s4):3, (n1,s1):5, next=6, gen=10] -> r24:/Table/2{8-9} [(n2,s2):6, next=7, gen=10]
replica has lost quorum, recovering: r25:/{Table/29-NamespaceTable/30} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r25:/{Table/29-NamespaceTable/30} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r26:/NamespaceTable/{30-Max} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r26:/NamespaceTable/{30-Max} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r27:/{NamespaceTable/Max-Table/32} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r27:/{NamespaceTable/Max-Table/32} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r28:/Table/3{2-3} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r28:/Table/3{2-3} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r29:/Table/3{3-4} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r29:/Table/3{3-4} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r30:/Table/3{4-5} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r30:/Table/3{4-5} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r31:/Table/3{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r31:/Table/3{5-6} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r32:/Table/3{6-7} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r32:/Table/3{6-7} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r33:/Table/3{7-8} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r33:/Table/3{7-8} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r34:/Table/3{8-9} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r34:/Table/3{8-9} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r35:/Table/{39-40} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r35:/Table/{39-40} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r36:/Table/4{0-1} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r36:/Table/4{0-1} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r37:/Table/4{1-2} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r37:/Table/4{1-2} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r38:/Table/4{2-3} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r38:/Table/4{2-3} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r39:/Table/4{3-4} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r39:/Table/4{3-4} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r40:/Table/4{4-5} [(n2,s2):4, (n4,s4):2, (n3,s3):3, (n1,s1):5, next=6, gen=10] -> r40:/Table/4{4-5} [(n2,s2):6, next=7, gen=10]
replica has lost quorum, recovering: r41:/Table/4{5-6} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r41:/Table/4{5-6} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r42:/Table/4{6-7} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r42:/Table/4{6-7} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r43:/{Table/47-Max} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r43:/{Table/47-Max} [(n2,s2):5, next=6, gen=6]
aborting intent: /Local/Range/System/tsd/RangeDescriptor (txn 4435d938-2f83-426c-a270-6aff4c966650)
Scanning replicas on store cluster_id:00578a7c-aef7-4181-b427-3f8d8aaa401a node_id:2 store_id:2  for dead peers []

id	is_live	replicas	is_decommissioning	membership	is_draining
3	false	43	true	decommissioning	false
4	false	43	true	decommissioning	false
    debug_test.go:391: expected replicas on {1,2,5,6} but got {1,2,5}
    --- FAIL: TestRemoveDeadReplicas/2/4/r=4 (26.39s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-parallel=4

/cc @cockroachdb/server @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-12480

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 19, 2022
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Jan 19, 2022
@tbg
Copy link
Member

tbg commented Jan 19, 2022

@irfansharif could this be fallout from the zonecfg change? I don't think we've touched this test. @aliher1911 has been working on LOQ recovery but I don't think he touched this test (?)

@irfansharif irfansharif self-assigned this Jan 20, 2022
@cockroach-teamcity
Copy link
Member Author

cli.TestRemoveDeadReplicas failed with artifacts on master @ e1068d77afbd39b162978281c9da7cbea49c1c3a:

replica has lost quorum, recovering: r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r2:/System/NodeLiveness{-Max} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r3:/System/{NodeLivenessMax-tsd} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r3:/System/{NodeLivenessMax-tsd} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r4:/System{/tsd-tse} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r4:/System{/tsd-tse} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r5:/{Systemtse-Table/SystemConfigSpan/Start} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r5:/{Systemtse-Table/SystemConfigSpan/Start} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r6:/Table/{SystemConfigSpan/Start-11} [(n3,s3):4, (n4,s4):2, (n2,s2):3, (n1,s1):5, next=6, gen=10] -> r6:/Table/{SystemConfigSpan/Start-11} [(n2,s2):6, next=7, gen=10]
replica has lost quorum, recovering: r7:/Table/1{1-2} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r7:/Table/1{1-2} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r8:/Table/1{2-3} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r8:/Table/1{2-3} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r9:/Table/1{3-4} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r9:/Table/1{3-4} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r10:/Table/1{4-5} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r10:/Table/1{4-5} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r11:/Table/1{5-6} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r11:/Table/1{5-6} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r12:/Table/1{6-7} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r12:/Table/1{6-7} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r13:/Table/1{7-8} [(n2,s2):4, (n3,s3):2, (n4,s4):3, (n1,s1):5, next=6, gen=10] -> r13:/Table/1{7-8} [(n2,s2):6, next=7, gen=10]
replica has lost quorum, recovering: r14:/Table/1{8-9} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r14:/Table/1{8-9} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r15:/Table/{19-20} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r15:/Table/{19-20} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r16:/Table/2{0-1} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r16:/Table/2{0-1} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r17:/Table/2{1-2} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r17:/Table/2{1-2} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r18:/Table/2{2-3} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r18:/Table/2{2-3} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r19:/Table/2{3-4} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r19:/Table/2{3-4} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r20:/Table/2{4-5} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r20:/Table/2{4-5} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r21:/Table/2{5-6} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r21:/Table/2{5-6} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r22:/Table/2{6-7} [(n4,s4):4, (n3,s3):2, (n2,s2):3, (n1,s1):5, next=6, gen=10] -> r22:/Table/2{6-7} [(n2,s2):6, next=7, gen=10]
replica has lost quorum, recovering: r23:/Table/2{7-8} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r23:/Table/2{7-8} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r24:/Table/2{8-9} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r24:/Table/2{8-9} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r25:/{Table/29-NamespaceTable/30} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r25:/{Table/29-NamespaceTable/30} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r26:/NamespaceTable/{30-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r26:/NamespaceTable/{30-Max} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r27:/{NamespaceTable/Max-Table/32} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r27:/{NamespaceTable/Max-Table/32} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r28:/Table/3{2-3} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r28:/Table/3{2-3} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r29:/Table/3{3-4} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r29:/Table/3{3-4} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r30:/Table/3{4-5} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r30:/Table/3{4-5} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r31:/Table/3{5-6} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5, gen=6] -> r31:/Table/3{5-6} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r32:/Table/3{6-7} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r32:/Table/3{6-7} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r33:/Table/3{7-8} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r33:/Table/3{7-8} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r34:/Table/3{8-9} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r34:/Table/3{8-9} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r35:/Table/{39-40} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r35:/Table/{39-40} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r36:/Table/4{0-1} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=6] -> r36:/Table/4{0-1} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r37:/Table/4{1-2} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r37:/Table/4{1-2} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r38:/Table/4{2-3} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r38:/Table/4{2-3} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r39:/Table/4{3-4} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=6] -> r39:/Table/4{3-4} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r40:/Table/4{4-5} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):4, next=5, gen=6] -> r40:/Table/4{4-5} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r41:/Table/4{5-6} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r41:/Table/4{5-6} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r42:/Table/4{6-7} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=6] -> r42:/Table/4{6-7} [(n2,s2):5, next=6, gen=6]
replica has lost quorum, recovering: r43:/{Table/47-Max} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=6] -> r43:/{Table/47-Max} [(n2,s2):5, next=6, gen=6]
aborting intent: /Local/Range/System/tsd/RangeDescriptor (txn 1f622e19-4a93-4ede-9ed3-202a9adde8ef)
Scanning replicas on store cluster_id:1f678502-228d-44a8-9d7a-f791bb7fb4ed node_id:2 store_id:2  for dead peers []

id	is_live	replicas	is_decommissioning	membership	is_draining
3	false	43	true	decommissioning	false
4	false	43	true	decommissioning	false
    debug_test.go:391: expected replicas on {1,2,5,6} but got {1,2,6}
    --- FAIL: TestRemoveDeadReplicas/2/4/r=4 (24.82s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-parallel=4

This test on roachdash | Improve this report!

@irfansharif
Copy link
Contributor

I'll try looking today, though dreading the 25s runtime for the test.

@irfansharif
Copy link
Contributor

PS: I'm not actively investigating this. Letting it sit on the server board to remind ourselves to address it during stability at the latest. My earlier stress attempts were unfruitful.

@jtsiros jtsiros added A-kv Anything in KV that doesn't belong in a more specific category. and removed T-server-and-security DB Server & Security labels Feb 17, 2022
@tbg
Copy link
Member

tbg commented Feb 24, 2022

F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1  attempted to change replica's ID from 11 to 2
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !goroutine 818600 [running]:
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x1)
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x8a
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc0054f0300, {{{0xc00875c510, 0x24}, {0x4a2d69d, 0x1}, {0x0, 0x0}, {0x0, 0x0}}, 0x16d6b712f0ed60fd, ...})
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:239 +0x97
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x5d728e8, 0xc0073853e0}, 0x1, 0x4, 0x0, {0x4a75cc3, 0x2e}, {0xc001558258, 0x2, 0x2})
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:60 +0x385
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:834
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked(0xc00710b500, {0x5d728e8, 0xc0073853e0}, 0xc000fbe540)
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_init.go:350 +0x3d5
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescRaftMuLocked(0xc00710b500, {0x5d728e8, 0xc0073853e0}, 0xc0064d4958)
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_init.go:318 +0x97
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleDescResult(...)
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_application_result.go:268
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc00710b610, {0x5d728e8, 0xc0073853e0}, 0xc003d54040)
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_application_state_machine.go:1324 +0x5d2
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc00710b610, {0x5d728e8, 0xc0073853e0}, {0x5de21a0, 0xc003d54008})
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_application_state_machine.go:1183 +0x655
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter({0x7f71cdee61b0, 0xc00710b880}, 0xc001559320)
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:206 +0x158
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc0015598a8, {0x5d728e8, 0xc0073853e0}, {0x5da59c0, 0xc00710b820})
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:291 +0x205
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc0015598a8, {0x5d728e8, 0xc0073853e0})
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:247 +0x9a
F220224 12:00:05.829878 818600 kv/kvserver/replica_init.go:350  [n2,s2,r1/11:/{Min-System/NodeL…},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(0xc00710b500, {0x5d728e8, 0xc0073853e0}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})

In https://teamcity.cockroachdb.com/viewLog.html?buildId=4448409&buildTypeId=Cockroach_UnitTests

@cockroach-teamcity
Copy link
Member Author

cli.TestRemoveDeadReplicas failed with artifacts on master @ d84f07e856dbb3d62f0336f4e8439280e2a018ed:

=== RUN   TestRemoveDeadReplicas
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/511bb7a3a798948c9c27fde89a1fae1e/logTestRemoveDeadReplicas1858003926
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestRemoveDeadReplicas
    debug_test.go:414: -- test log scope end --
--- FAIL: TestRemoveDeadReplicas (24.79s)
=== RUN   TestRemoveDeadReplicas/2/4/r=4
    debug_test.go:234: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD - WriteTooOld flag converted to WriteTooOldError): "unnamed" meta={id=7fd6ce0d key=/Local/Range/System/tsd/RangeDescriptor pri=0.01740530 epo=0 ts=1647670503.023559277,1 min=1647670503.002154173,0 seq=1} lock=true stat=PENDING rts=1647670503.002154173,0 wto=false gul=1647670503.502154173,0
    --- FAIL: TestRemoveDeadReplicas/2/4/r=4 (6.72s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

cli.TestRemoveDeadReplicas failed with artifacts on master @ e02793d3e214c5b716149963369aa72f9338ac0d:

replica has not lost quorum, skipping: r10:/Table/1{4-5} [(n1,s1):7, (n4,s4):2, (n2,s2):3, next=8, gen=20]
replica has not lost quorum, skipping: r11:/Table/1{5-6} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has lost quorum, recovering: r12:/Table/1{6-7} [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=8] -> r12:/Table/1{6-7} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r15:/Table/{19-20} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has lost quorum, recovering: r16:/Table/2{0-1} [(n4,s4):4, (n2,s2):2, (n3,s3):3, next=5, gen=8] -> r16:/Table/2{0-1} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r17:/Table/2{1-2} [(n4,s4):4, (n2,s2):2, (n3,s3):14, next=15, gen=48] -> r17:/Table/2{1-2} [(n2,s2):15, next=16, gen=48]
replica has lost quorum, recovering: r18:/Table/2{2-3} [(n3,s3):6VOTER_DEMOTING_LEARNER, (n4,s4):2, (n2,s2):3, (n1,s1):7VOTER_INCOMING, next=8, gen=18] -> r18:/Table/2{2-3} [(n2,s2):8, next=9, gen=18]
replica has not lost quorum, skipping: r19:/Table/2{3-4} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r21:/Table/2{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has lost quorum, recovering: r22:/Table/2{6-7} [(n3,s3):6, (n4,s4):2, (n2,s2):3, next=7, gen=16] -> r22:/Table/2{6-7} [(n2,s2):7, next=8, gen=16]
replica has lost quorum, recovering: r23:/Table/2{7-8} [(n2,s2):6, (n4,s4):2, (n3,s3):3, next=7, gen=16] -> r23:/Table/2{7-8} [(n2,s2):7, next=8, gen=16]
replica has not lost quorum, skipping: r26:/NamespaceTable/{30-Max} [(n2,s2):4, (n4,s4):2, (n1,s1):5, next=6, gen=12]
replica has not lost quorum, skipping: r27:/{NamespaceTable/Max-Table/32} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r28:/Table/3{2-3} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r29:/Table/3{3-4} [(n1,s1):7, (n2,s2):2, (n3,s3):8, next=9, gen=24]
replica has lost quorum, recovering: r31:/Table/3{5-6} [(n3,s3):10, (n2,s2):6, (n4,s4):8, next=11, gen=32] -> r31:/Table/3{5-6} [(n2,s2):11, next=12, gen=32]
replica has not lost quorum, skipping: r32:/Table/3{6-7} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r33:/Table/3{7-8} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r34:/Table/3{8-9} [(n1,s1):5, (n3,s3):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r36:/Table/4{0-1} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has lost quorum, recovering: r38:/Table/4{2-3} [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=8] -> r38:/Table/4{2-3} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r39:/Table/4{3-4} [(n1,s1):1, (n3,s3):2, (n2,s2):4, next=5, gen=8]
replica has not lost quorum, skipping: r42:/Table/4{6-7} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r44:/{Table/50-Max} [(n2,s2):4, (n4,s4):2, (n3,s3):8, next=9, gen=24] -> r44:/{Table/50-Max} [(n2,s2):9, next=10, gen=24]
aborting intent: /Local/Range/Table/SystemConfigSpan/Start/RangeDescriptor (txn 1121bdb7-e748-4666-a1cd-96f4f7e6497e)
aborting intent: /Local/Range/Table/22/RangeDescriptor (txn 7bbd52c8-a839-47b0-a043-9ba233dde995)
Scanning replicas on store cluster_id:a9e9f2a7-899a-44e5-8e2f-60e134fa8c40 node_id:2 store_id:2  for dead peers []
replica has not lost quorum, skipping: r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r4:/System{/tsd-tse} [(n1,s1):7, (n3,s3):2, (n2,s2):6, next=8, gen=20]
replica has not lost quorum, skipping: r7:/Table/1{1-2} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r10:/Table/1{4-5} [(n1,s1):7, (n4,s4):2, (n2,s2):3, next=8, gen=20]
replica has not lost quorum, skipping: r11:/Table/1{5-6} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r15:/Table/{19-20} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r19:/Table/2{3-4} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r21:/Table/2{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r26:/NamespaceTable/{30-Max} [(n2,s2):4, (n4,s4):2, (n1,s1):5, next=6, gen=12]
replica has not lost quorum, skipping: r27:/{NamespaceTable/Max-Table/32} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r28:/Table/3{2-3} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r29:/Table/3{3-4} [(n1,s1):7, (n2,s2):2, (n3,s3):8, next=9, gen=24]
replica has not lost quorum, skipping: r32:/Table/3{6-7} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r33:/Table/3{7-8} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r34:/Table/3{8-9} [(n1,s1):5, (n3,s3):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r36:/Table/4{0-1} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r39:/Table/4{3-4} [(n1,s1):1, (n3,s3):2, (n2,s2):4, next=5, gen=8]
replica has not lost quorum, skipping: r42:/Table/4{6-7} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

cli.TestRemoveDeadReplicas failed with artifacts on master @ 63ea9139e2ca996e38b5fe7c7b43a97e625242f5:

replica has not lost quorum, skipping: r9:/Table/1{3-4} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r10:/Table/1{4-5} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has lost quorum, recovering: r11:/Table/1{5-6} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r11:/Table/1{5-6} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r12:/Table/1{6-7} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has lost quorum, recovering: r14:/Table/1{8-9} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r14:/Table/1{8-9} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r15:/Table/{19-20} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r18:/Table/2{2-3} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r19:/Table/2{3-4} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r19:/Table/2{3-4} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r21:/Table/2{5-6} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r21:/Table/2{5-6} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r22:/Table/2{6-7} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r23:/Table/2{7-8} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has lost quorum, recovering: r24:/Table/2{8-9} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r24:/Table/2{8-9} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r25:/{Table/29-NamespaceTable/30} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r25:/{Table/29-NamespaceTable/30} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r26:/NamespaceTable/{30-Max} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r28:/Table/3{2-3} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r28:/Table/3{2-3} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r32:/Table/3{6-7} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r33:/Table/3{7-8} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r34:/Table/3{8-9} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r35:/Table/{39-40} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r35:/Table/{39-40} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has lost quorum, recovering: r38:/Table/4{2-3} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r38:/Table/4{2-3} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r39:/Table/4{3-4} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r39:/Table/4{3-4} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r40:/Table/4{4-5} [(n1,s1):1, (n3,s3):4, (n2,s2):3, next=5, gen=8]
replica has not lost quorum, skipping: r41:/Table/4{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r42:/Table/4{6-7} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r43:/Table/{47-50} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r44:/{Table/50-Max} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
aborting intent: /Local/Range/System/"tse"/RangeDescriptor (txn e67ba556-0e13-475b-ac22-d58303a3c0e3)
Scanning replicas on store cluster_id:3eaad126-edd6-4258-bfc6-6699e140f6d2 node_id:2 store_id:2  for dead peers []
replica has not lost quorum, skipping: r4:/System{/tsd-tse} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r8:/Table/1{2-3} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r9:/Table/1{3-4} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r10:/Table/1{4-5} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r12:/Table/1{6-7} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r15:/Table/{19-20} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r18:/Table/2{2-3} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r22:/Table/2{6-7} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r23:/Table/2{7-8} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r26:/NamespaceTable/{30-Max} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r32:/Table/3{6-7} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r33:/Table/3{7-8} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r34:/Table/3{8-9} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r40:/Table/4{4-5} [(n1,s1):1, (n3,s3):4, (n2,s2):3, next=5, gen=8]
replica has not lost quorum, skipping: r41:/Table/4{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r42:/Table/4{6-7} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r43:/Table/{47-50} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r44:/{Table/50-Max} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

cli.TestRemoveDeadReplicas failed with artifacts on master @ 43457fc6e2f5324823cca4156d9799453753b9cc:

replica has lost quorum, recovering: r8:/Table/1{2-3} [(n4,s4):4, (n2,s2):2, (n3,s3):3, next=5, gen=8] -> r8:/Table/1{2-3} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r9:/Table/1{3-4} [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=8] -> r9:/Table/1{3-4} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r10:/Table/1{4-5} [(n4,s4):4, (n2,s2):2, (n3,s3):3, next=5, gen=8] -> r10:/Table/1{4-5} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r12:/Table/1{6-7} [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=8] -> r12:/Table/1{6-7} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r15:/Table/{19-20} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r16:/Table/2{0-1} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r17:/Table/2{1-2} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has lost quorum, recovering: r18:/Table/2{2-3} [(n4,s4):4, (n2,s2):2, (n3,s3):3, next=5, gen=8] -> r18:/Table/2{2-3} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r19:/Table/2{3-4} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r21:/Table/2{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r23:/Table/2{7-8} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r25:/{Table/29-NamespaceTable/30} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r27:/{NamespaceTable/Max-Table/32} [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=8] -> r27:/{NamespaceTable/Max-Table/32} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r31:/Table/3{5-6} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has lost quorum, recovering: r32:/Table/3{6-7} [(n4,s4):4, (n2,s2):2, (n3,s3):3, next=5, gen=8] -> r32:/Table/3{6-7} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r33:/Table/3{7-8} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r34:/Table/3{8-9} [(n4,s4):4, (n2,s2):2, (n1,s1):5, next=6, gen=12]
replica has not lost quorum, skipping: r35:/Table/{39-40} [(n4,s4):4, (n1,s1):5, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r38:/Table/4{2-3} [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=8] -> r38:/Table/4{2-3} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r39:/Table/4{3-4} [(n4,s4):4, (n2,s2):2, (n3,s3):3, next=5, gen=8] -> r39:/Table/4{3-4} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r40:/Table/4{4-5} [(n4,s4):4, (n2,s2):2, (n3,s3):3, next=5, gen=8] -> r40:/Table/4{4-5} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r41:/Table/4{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r42:/Table/4{6-7} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r43:/Table/{47-50} [(n1,s1):5, (n2,s2):2, (n3,s3):3, next=6, gen=12]
replica has lost quorum, recovering: r44:/{Table/50-Max} [(n4,s4):4, (n2,s2):2, (n3,s3):3, next=5, gen=8] -> r44:/{Table/50-Max} [(n2,s2):5, next=6, gen=8]
aborting intent: /Local/Range/System/tsd/RangeDescriptor (txn adb2fca2-d682-4427-9120-471b2a85af50)
aborting intent: /Local/Range/Table/16/RangeDescriptor (txn affa94ff-f3b2-4385-8159-204b5b60abe8)
Scanning replicas on store cluster_id:b41cc9ff-0801-477d-b7b3-835ce446cc54 node_id:2 store_id:2  for dead peers []
replica has not lost quorum, skipping: r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r3:/System/{NodeLivenessMax-tsd} [(n1,s1):5, (n3,s3):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r5:/{Systemtse-Table/SystemConfigSpan/Start} [(n4,s4):4, (n1,s1):5, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r6:/Table/{SystemConfigSpan/Start-11} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r15:/Table/{19-20} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r16:/Table/2{0-1} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r17:/Table/2{1-2} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r19:/Table/2{3-4} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r21:/Table/2{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r23:/Table/2{7-8} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r25:/{Table/29-NamespaceTable/30} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r31:/Table/3{5-6} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r33:/Table/3{7-8} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r34:/Table/3{8-9} [(n4,s4):4, (n2,s2):2, (n1,s1):5, next=6, gen=12]
replica has not lost quorum, skipping: r35:/Table/{39-40} [(n4,s4):4, (n1,s1):5, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r41:/Table/4{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r42:/Table/4{6-7} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r43:/Table/{47-50} [(n1,s1):5, (n2,s2):2, (n3,s3):3, next=6, gen=12]
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss,deadlock

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Apr 14, 2022

@aliher1911 interesting failure mode:

F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1  attempted to change replica's ID from 5 to 3
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !goroutine 17411736 [running]:
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x0)
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !	github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x8a
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc0065d0480, {{{0xc013156db0, 0x24}, {0x4b8ea20, 0x1}, {0x0, 0x0}, {0x0, 0x0}}, 0x16e55f5dfbdd3dbb, ...})
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !	github.com/cockroachdb/cockroach/pkg/util/log/clog.go:237 +0xb8
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x5f703a8, 0xc01cbb1230}, 0x1, 0x4, 0x0, {0x4bd9165, 0x2e}, {0xc018812200, 0x2, 0x2})
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:60 +0x385
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !	github.com/cockroachdb/cockroach/bazel-out/k8-dbg/bin/pkg/util/log/log_channels_generated.go:834
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked(0xc009578000, {0x5f703a8, 0xc01cbb1230}, 0xc0101ea9a0)
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:344 +0x3d5
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescRaftMuLocked(0xc009578000, {0x5f703a8, 0xc01cbb1230}, 0x9655a9)
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:312 +0xcc
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleDescResult(...)
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go:268
F220413 06:01:35.763332 17411736 kv/kvserver/pkg/kv/kvserver/replica_init.go:344  [n2,s2,r12/5:/Table/1{6-7},raft] 1 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc009578110, {0x5f703a8, 0xc01cbb1230}, 0xc003e31040)

@tbg tbg added the GA-blocker label Apr 14, 2022
@tbg
Copy link
Member

tbg commented Apr 14, 2022

I'm not sure this is anything new, and it does happen during a test that exercises recovery from loss of quorum, but it's worth a second look before GA potentially.

@erikgrinaker
Copy link
Contributor

Let's make sure this gets timely attention and doesn't hold up the RCs.

@aliher1911 Have you had a look at this, and/or would you like a hand?

@erikgrinaker
Copy link
Contributor

Stressed this for a couple of hours, no hits. I'm mostly interested in the attempted to change replica's ID failure mode here. Will have a look at some possibly related PRs, maybe some of the replica ID work that Sumeer did.

@erikgrinaker
Copy link
Contributor

I'm suspecting #75761, which fits time-wise. #76248 seems unlikely, but possible.

CC @sumeerbhola.

@erikgrinaker
Copy link
Contributor

Tracked by #79074.

@erikgrinaker
Copy link
Contributor

I think this could be as simple as removeDeadReplicas not writing the new replica ID to RaftReplicaIDKey here:

cockroach/pkg/cli/debug.go

Lines 1248 to 1252 in 87610fe

replicas := []roachpb.ReplicaDescriptor{{
NodeID: storeIdent.NodeID,
StoreID: storeIdent.StoreID,
ReplicaID: desc.NextReplicaID,
}}

We're changing the replica ID in the local range descriptor, but not updating RaftReplicaIDKey. Seems plausible that it would lead to this failure.

@aliher1911 We'll probably have to handle this in the new LoQ tooling too.

@erikgrinaker erikgrinaker removed blocks-22.1.0-rc.1 GA-blocker branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 deprecated-branch-release-22.1.0 S-1 High impact: many users impacted, serious risk of high unavailability or data loss labels Apr 28, 2022
@erikgrinaker
Copy link
Contributor

Removing GA blocker, since we've merged the 22.1.0 backport in #80626. Will look into addressing the test flake here when I have time.

@cockroach-teamcity
Copy link
Member Author

cli.TestRemoveDeadReplicas failed with artifacts on master @ d63a3693480d2b62aebd15c11799d2ccdc8564a4:

replica has lost quorum, recovering: r10:/Table/1{4-5} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r10:/Table/1{4-5} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r11:/Table/1{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r12:/Table/1{6-7} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r14:/Table/1{8-9} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r16:/Table/2{0-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has lost quorum, recovering: r17:/Table/2{1-2} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r17:/Table/2{1-2} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r18:/Table/2{2-3} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r19:/Table/2{3-4} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r19:/Table/2{3-4} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r20:/Table/2{4-5} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r22:/Table/2{6-7} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r23:/Table/2{7-8} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r25:/{Table/29-NamespaceTable/30} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r26:/NamespaceTable/{30-Max} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r27:/{NamespaceTable/Max-Table/32} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has lost quorum, recovering: r28:/Table/3{2-3} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r28:/Table/3{2-3} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r30:/Table/3{4-5} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has lost quorum, recovering: r33:/Table/3{7-8} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r33:/Table/3{7-8} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r34:/Table/3{8-9} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r34:/Table/3{8-9} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r35:/Table/{39-40} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r35:/Table/{39-40} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r36:/Table/4{0-1} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r38:/Table/4{2-3} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has lost quorum, recovering: r39:/Table/4{3-4} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r39:/Table/4{3-4} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r40:/Table/4{4-5} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r40:/Table/4{4-5} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r41:/Table/4{5-6} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has lost quorum, recovering: r43:/Table/{47-50} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r43:/Table/{47-50} [(n2,s2):5, next=6, gen=8]
aborting intent: /Local/Range/System/tsd/RangeDescriptor (txn 7da2fbdc-7fe6-4fe2-bdb9-b4256d01f71f)
Scanning replicas on store cluster_id:4fe8e867-8fa4-47d2-bd6c-13ae3023e11f node_id:2 store_id:2  for dead peers []
replica has not lost quorum, skipping: r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r5:/{Systemtse-Table/SystemConfigSpan/Start} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r7:/Table/1{1-2} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r9:/Table/1{3-4} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r11:/Table/1{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r12:/Table/1{6-7} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r14:/Table/1{8-9} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r16:/Table/2{0-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r18:/Table/2{2-3} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r20:/Table/2{4-5} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r22:/Table/2{6-7} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r23:/Table/2{7-8} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r25:/{Table/29-NamespaceTable/30} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r26:/NamespaceTable/{30-Max} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r27:/{NamespaceTable/Max-Table/32} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r30:/Table/3{4-5} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r36:/Table/4{0-1} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r38:/Table/4{2-3} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r41:/Table/4{5-6} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

cli.TestRemoveDeadReplicas failed with artifacts on master @ 1a91d7cb7b93dfef5dcaf872125875cefa3e0190:

=== RUN   TestRemoveDeadReplicas
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/511bb7a3a798948c9c27fde89a1fae1e/logTestRemoveDeadReplicas2130869374
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestRemoveDeadReplicas
    debug_test.go:383: -- test log scope end --
--- FAIL: TestRemoveDeadReplicas (88.89s)
=== RUN   TestRemoveDeadReplicas/2/4/r=4
    debug_test.go:203: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD - WriteTooOld flag converted to WriteTooOldError): "unnamed" meta={id=38a1d885 key=/Local/Range/System/tsd/RangeDescriptor pri=0.02782531 epo=0 ts=1652163432.185033059,1 min=1652163432.109470599,0 seq=1} lock=true stat=PENDING rts=1652163432.109470599,0 wto=false gul=1652163432.609470599,0
    --- FAIL: TestRemoveDeadReplicas/2/4/r=4 (46.09s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss,deadlock

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

cli.TestRemoveDeadReplicas failed with artifacts on master @ 16c484aa84d3718bfc82557f8e935ab78e6753b6:

=== RUN   TestRemoveDeadReplicas
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/511bb7a3a798948c9c27fde89a1fae1e/logTestRemoveDeadReplicas74826838
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestRemoveDeadReplicas
    debug_test.go:383: -- test log scope end --
--- FAIL: TestRemoveDeadReplicas (125.39s)
=== RUN   TestRemoveDeadReplicas/2/4/r=4
    testcluster.go:141: condition failed to evaluate within 45s: unexpectedly found 1 active spans:
             0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch _unfinished:1 span.kind:server
        goroutine 12259496 [running]:
        runtime/debug.Stack()
        	GOROOT/src/runtime/debug/stack.go:24 +0x65
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x62aa9f8, 0xc001b3e820}, 0x4b37540, 0x6277e00)
        	github.com/cockroachdb/cockroach/pkg/testutils/soon.go:60 +0x5f
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x62aa9f8, 0xc001b3e820}, 0xc001b3e820)
        	github.com/cockroachdb/cockroach/pkg/testutils/soon.go:41 +0x4a
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers(0xc004434580, {0x6277e18, 0xc00012e028})
        	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:141 +0x352
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func2()
        	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:356 +0x2b
        github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close(0xc01199c090)
        	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:107 +0x1a
        github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc01199c090, {0x6277e18, 0xc00012e020})
        	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:563 +0x2cc
        github.com/cockroachdb/cockroach/pkg/cli.TestRemoveDeadReplicas.func1.1(0xc001b3e820, 0xc006844288, 0xc000000004, {0x6277e18, 0xc00012e020})
        	github.com/cockroachdb/cockroach/pkg/cli/debug_test.go:220 +0x49e
        github.com/cockroachdb/cockroach/pkg/cli.TestRemoveDeadReplicas.func1(0xc001b3e820)
        	github.com/cockroachdb/cockroach/pkg/cli/debug_test.go:220 +0x413
        testing.tRunner(0xc001b3e820, 0xc011e8a080)
        	GOROOT/src/testing/testing.go:1259 +0x102
        created by testing.(*T).Run
        	GOROOT/src/testing/testing.go:1306 +0x35a
    --- FAIL: TestRemoveDeadReplicas/2/4/r=4 (75.94s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss,deadlock

Same failure on other branches

This test on roachdash | Improve this report!

@tbg tbg added the S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) label May 30, 2022
@cockroach-teamcity
Copy link
Member Author

cli.TestRemoveDeadReplicas failed with artifacts on master @ 8d34ef1ea15850ee1c70470610b6652df4c317de:

replica has lost quorum, recovering: r12:/Table/1{6-7} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r12:/Table/1{6-7} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r14:/Table/1{8-9} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has lost quorum, recovering: r15:/Table/{19-20} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r15:/Table/{19-20} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r16:/Table/2{0-1} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r17:/Table/2{1-2} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r17:/Table/2{1-2} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r18:/Table/2{2-3} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r20:/Table/2{4-5} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r21:/Table/2{5-6} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r21:/Table/2{5-6} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r22:/Table/2{6-7} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has lost quorum, recovering: r23:/Table/2{7-8} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r23:/Table/2{7-8} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r24:/Table/2{8-9} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r24:/Table/2{8-9} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r26:/NamespaceTable/{30-Max} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has lost quorum, recovering: r27:/{NamespaceTable/Max-Table/32} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r27:/{NamespaceTable/Max-Table/32} [(n2,s2):5, next=6, gen=8]
replica has lost quorum, recovering: r28:/Table/3{2-3} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r28:/Table/3{2-3} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r29:/Table/3{3-4} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has lost quorum, recovering: r30:/Table/3{4-5} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r30:/Table/3{4-5} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r31:/Table/3{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has lost quorum, recovering: r33:/Table/3{7-8} [(n3,s3):4, (n2,s2):2, (n4,s4):3, next=5, gen=8] -> r33:/Table/3{7-8} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r35:/Table/{39-40} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
replica has not lost quorum, skipping: r36:/Table/4{0-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r38:/Table/4{2-3} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r39:/Table/4{3-4} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r40:/Table/4{4-5} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has lost quorum, recovering: r42:/Table/4{6-7} [(n3,s3):4, (n4,s4):2, (n2,s2):3, next=5, gen=8] -> r42:/Table/4{6-7} [(n2,s2):5, next=6, gen=8]
replica has not lost quorum, skipping: r43:/Table/{47-50} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r44:/{Table/50-Max} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
aborting intent: /Local/Range/NamespaceTable/Max/RangeDescriptor (txn 3016be28-7881-4acd-bc96-5812eb316aee)
Scanning replicas on store cluster_id:d8008f3f-ed85-4b0e-bb7f-8adce3231ffb node_id:2 store_id:2  for dead peers []
replica has not lost quorum, skipping: r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r5:/{Systemtse-Table/SystemConfigSpan/Start} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r11:/Table/1{5-6} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
replica has not lost quorum, skipping: r13:/Table/1{7-8} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r14:/Table/1{8-9} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r16:/Table/2{0-1} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r18:/Table/2{2-3} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r20:/Table/2{4-5} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r22:/Table/2{6-7} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r26:/NamespaceTable/{30-Max} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r29:/Table/3{3-4} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=4]
replica has not lost quorum, skipping: r31:/Table/3{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r35:/Table/{39-40} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]
replica has not lost quorum, skipping: r36:/Table/4{0-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r37:/Table/4{1-2} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r38:/Table/4{2-3} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r39:/Table/4{3-4} [(n1,s1):5, (n4,s4):2, (n2,s2):3, next=6, gen=12]
replica has not lost quorum, skipping: r40:/Table/4{4-5} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
replica has not lost quorum, skipping: r43:/Table/{47-50} [(n1,s1):1, (n4,s4):2, (n2,s2):3, next=4, gen=4]
replica has not lost quorum, skipping: r44:/{Table/50-Max} [(n1,s1):5, (n2,s2):2, (n4,s4):3, next=6, gen=12]

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

tbg added a commit to tbg/cockroach that referenced this issue Jun 27, 2022
Refs: cockroachdb#75133

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None
craig bot pushed a commit that referenced this issue Jun 29, 2022
83389: testutils: add `storageutils` test utilities r=nicktrav a=erikgrinaker

This patch adds a bunch of test utilities to `storageutils`, replacing
the old `sstutil` package. This is done to ease testing of MVCC range
keys in tests outside the `storage` package.

Unfortunately, these are mostly duplicates of utilities in `storage`.
Storage tests use the `storage` package rather than `storage_test`, and
can't make use of `storageutils` yet because it causes an import cycle
with `storage`. This will (hopefully) be addressed separately.

Release note: None

83448: cli: skip TestRemoveDeadReplicas r=erikgrinaker a=tbg

Refs: #75133

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

83561: backupccl: deflake TestMetadataSST r=stevendanna a=stevendanna

This is a temporary fix for failures in TestMetadataSST.

First,

    backup_metadata_test.go:49: error executing 'BACKUP TO $1': pq: a
    CCL binary is required to use this statement type: *tree.Backup

is solved by moving the test back into the backupccl package so that
the plan hook for BACKUP is definitely registered.

Second,

    backup_metadata_test.go:89: file /0/BACKUP_MANIFEST does not exist
    in the UserFileTableSystem: external_storage: file doesn't exist

is solved by disabling tenants by default in the setup functions in
backuputils. There is duplication between these functions and the
functions in backupccl, where we already disable tenants. The error
above is likely the result of us directly using the internal executor
of the server to query the userfile storage, which may be incorrect if
we were actually talking to a tenant.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
@erikgrinaker
Copy link
Contributor

We'll remove this test when we remove the tool itself over in #84807.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) skipped-test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants