Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover/chaos/read-write failed #104709

Closed
cockroach-teamcity opened this issue Jun 11, 2023 · 13 comments · Fixed by #106104
Closed

roachtest: failover/chaos/read-write failed #104709

cockroach-teamcity opened this issue Jun 11, 2023 · 13 comments · Fixed by #106104
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 11, 2023

roachtest.failover/chaos/read-write failed with artifacts on master @ 6e38f676273d08b3a111213d6c9f7945629228d7:

test artifacts and logs in: /artifacts/failover/chaos/read-write/run_1
(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1379
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:288
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	pq: replica unavailable: (n4,s4):11 unable to serve request to r792:/Table/106/1/-3{22495525764153008-04067210006201408} [(n4,s4):11, (n8,s8):10, (n3,s3):9, (n7,s7):6, (n6,s6):7, next=12, gen=76, sticky=9223372036.854775807,2147483647]: closed timestamp: 1686489678.462266747,0 (2023-06-11 13:21:18); raft status: {"id":"b","term":14,"vote":"b","commit":1205,"lead":"b","raftState":"StateLeader","applied":1205,"progress":{"6":{"match":1205,"next":1206,"state":"StateReplicate"},"7":{"match":1205,"next":1206,"state":"StateReplicate"},"9":{"match":1205,"next":1206,"state":"StateReplicate"},"a":{"match":1205,"next":1206,"state":"StateReplicate"},"b":{"match":1205,"next":1206,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch /Local/Range/Table/106/1/-322495525764153008/RangeDescriptor@0,0
	Test:       	failover/chaos/read-write
(require.go:1360).NoError: FailNow called
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called
(cluster.go:2276).Run: cluster.RunE: context canceled
(cluster.go:2276).Run: cluster.RunE: context canceled
(cluster.go:2276).Run: cluster.RunE: context canceled
(cluster.go:2276).Run: cluster.RunE: context canceled

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-28685

Epic CRDB-27234

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jun 11, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 11, 2023
@erikgrinaker
Copy link
Contributor

Something seems a bit messed up here. We failed on a deadlockFailer.Ready() while executing a SHOW CLUSTER RANGES WITH DETAILS following a concurrent disk stall and deadlock failure, due to a replica circuit breaker on r792. However, r792 wasn't involved in the deadlock failure, and we shouldn't have lost quorum on it here either. The range had a leader, and all followers appeared caught up.

I'll dig into this later.

13:19:27 failover.go:312: failing n8 (disk-stall)
13:19:28 failover.go:312: failing n6 (deadlock)
13:19:28 failover.go:1424: locked r364 on n6
13:19:28 failover.go:1424: locked r1086 on n6
13:20:28 failover.go:320: recovering n6 (deadlock)
13:20:28 failover.go:1455: unlocked r364 on n6
13:20:28 failover.go:1455: unlocked r1086 on n6
13:20:28 failover.go:320: recovering n8 (disk-stall)
13:20:29 cluster.go:859: test status: stopping nodes :8
teamcity-10484309-1686462469-66-n10cpu2: stopping
13:20:30 cluster.go:859: test status: starting nodes :8
13:20:30 cockroach.go:184: teamcity-10484309-1686462469-66-n10cpu2: starting nodes
13:21:36 test_impl.go:349: test failure #1: full stack retained in failure_1.log: (assertions.go:333).Fail:
        Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1379
                                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:288
                                                main/pkg/cmd/roachtest/monitor.go:105
                                                golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
                                                GOROOT/src/runtime/asm_amd64.s:1594
        Error:          Received unexpected error:
                        pq: replica unavailable: (n4,s4):11 unable to serve request to r792:/Table/106/1/-3{22495525764153008-04067210006201408} [(n4,s4):11, (n8,s8):10, (n3,s3):9, (n7,s7):6, (n6,s6):7, next=12, gen=76, sticky=9223372036.854775807,2147483647]: closed timestamp: 1686489678.462266747,0 (2023-06-11 13:21:18); raft status: {"id":"b","term":14,"vote":"b","commit":1205,"lead":"b","raftState":"StateLeader","applied":1205,"progress":{"6":{"match":1205,"next":1206,"state":"StateReplicate"},"7":{"match":1205,"next":1206,"state":"StateReplicate"},"9":{"match":1205,"next":1206,"state":"StateReplicate"},"a":{"match":1205,"next":1206,"state":"StateReplicate"},"b":{"match":1205,"next":1206,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch /Local/Range/Table/106/1/-322495525764153008/RangeDescriptor@0,0

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 12, 2023
@erikgrinaker erikgrinaker self-assigned this Jun 12, 2023
@exalate-issue-sync exalate-issue-sync bot added T-kv-replication and removed T-kv KV Team labels Jun 12, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 12, 2023

cc @cockroachdb/replication

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write failed with artifacts on master @ 9633594f139d80d1555127a09f1ecd12dd1d9b11:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1379
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:288
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	pq: replica unavailable: (n7,s7):6 unable to serve request to r766:/Table/106/1/28{65603100361473792-84031416119425392} [(n9,s9):7, (n5,s5):2, (n4,s4):13, (n6,s6):12, (n7,s7):6, next=14, gen=88, sticky=9223372036.854775807,2147483647]: closed timestamp: 1687179324.822085781,0 (2023-06-19 12:55:24); raft status: {"id":"6","term":15,"vote":"6","commit":1887,"lead":"6","raftState":"StateLeader","applied":1887,"progress":{"6":{"match":1887,"next":1888,"state":"StateReplicate"},"7":{"match":1887,"next":1888,"state":"StateReplicate"},"c":{"match":1887,"next":1888,"state":"StateReplicate"},"d":{"match":1887,"next":1888,"state":"StateReplicate"},"2":{"match":1887,"next":1888,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch /Local/Range/Table/106/1/2865603100361473792/RangeDescriptor@0,0
	Test:       	failover/chaos/read-write
(require.go:1360).NoError: FailNow called
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-write/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 20, 2023

There's a lot of other stuff going on here in the latest failure. The async rangelog updates we added in #102813 appear to be failing reliably during a scatter, with hundreds of these:

W230619 12:37:02.939045 45952 kv/kvserver/range_log.go:254 ⋮ [n7,s7,r1064/2:‹/Table/106/1/20{17900…-36328…}›,*kvpb.AdminScatterRequest] 1685  error logging to system.rangelog: operation ‹"rangelog-timeout"› timed out after 5s (given timeout 5s): aborted during DistSender.Send: context deadline exceeded

We're also seeing tons of campaign failures, which may either be related to the above, or fallout from #104969:

W230619 12:37:03.185264 156 go.etcd.io/raft/v3/raft.go:917 ⋮ [T1,n7,s7,r364/3:‹/Table/106/1/-32{8945…-7102…}›] 1760  3 cannot campaign at term 9 since there are still pending configuration changes to apply

And theres tons of txn heartbeat failures due to using a canceled context:

I230619 12:36:59.982466 46378 server/node.go:1317 ⋮ [T1,n7] 1352  batch request ‹HeartbeatTxn [/Local/Range/Table/106/1/-8578380985326469808/RangeDescriptor,/Min), [txn: c9232c77]› failed with error: ‹result is ambiguous: after 0.02s of attempting command: context canceled›
I230619 12:36:59.982466 46378 server/node.go:1317 ⋮ [T1,n7] 1352 +trace:
I230619 12:36:59.982466 46378 server/node.go:1317 ⋮ [T1,n7] 1352 +‹<empty recording>›

And a conf change that's been stalled for over 10 minutes:

W230619 12:53:41.960507 184 kv/kvserver/replica_raft.go:1528 ⋮ [T1,n7,s7,r766/6:‹/Table/106/1/28{65603…-84031…}›,raft] 24131  have been waiting 974.50s for slow proposal EndTxn(commit change-replicas) [/Local/Range/Table/106/1/‹2865603100361473792›/‹RangeDescriptor›], [txn: d335bbe7]

I'm going to try to disentangle this mess.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 20, 2023

I'm seeing both the campaign failures and HeartbeatTxn failures prior to both of those PRs, so it's not new. This would also go some way towards explaining why the rangelog writes fail, if we're temporarily losing Raft leadership. Bumping the timeout to 10 seconds would likely help, I'll give that a try.

However, I'm not seeing the "pending configuration changes" error on 23.1, but I am seeing a few similar "9 is unpromotable and can not campaign" errors. I'm not seeing any of the HeartbeatTxn errors on 23.1.

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write failed with artifacts on master @ 1eb628e8c8a7eb1cbf9bfa2bd6c31982c25cbba0:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1379
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:288
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	pq: error getting span statistics - number of spans in request payload (1059) exceeds 'server.span_stats.span_batch_limit' cluster setting limit (500)
	Test:       	failover/chaos/read-write
(require.go:1360).NoError: FailNow called
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-write/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

Last failure is #105274.

@erikgrinaker
Copy link
Contributor

Split out #105378 for the HeartbeatTxn trace logging, it's fallout from #102793.

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write failed with artifacts on master @ cb6dd27fdfbfc8a9d0d4753809f4588bae8060d0:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1408
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:299
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	pq: error getting span statistics - number of spans in request payload (1064) exceeds 'server.span_stats.span_batch_limit' cluster setting limit (500)
	Test:       	failover/chaos/read-write
(require.go:1360).NoError: FailNow called
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-write/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

Opened #105413 for the rangelog timeouts.

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write failed with artifacts on master @ 7fd4c21157221eae9e7d5892d89d2b5a671aba3e:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1408
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:299
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	pq: error getting span statistics - number of spans in request payload (1064) exceeds 'server.span_stats.span_batch_limit' cluster setting limit (500)
	Test:       	failover/chaos/read-write
(require.go:1360).NoError: FailNow called
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-write/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-write failed with artifacts on master @ 0b0a212ecbe514b55301f4a963edc92b16de9bab:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1429
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:302
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	pq: error requesting iterating nodes for span stats from node 8 (NODE_STATUS_LIVE): rpc error: code = Unknown desc = replica unavailable: (n6,s6):7 unable to serve request to r1039:/Table/106/1/65{32837936193842192-51266251951793792} [(n3,s3):13, (n6,s6):7, (n8,s8):11, (n4,s4):14, (n9,s9):5, next=15, gen=92, sticky=9223372036.854775807,2147483647]: closed timestamp: 1687871114.188750889,0 (2023-06-27 13:05:14); raft status: {"id":"7","term":13,"vote":"7","commit":959,"lead":"7","raftState":"StateLeader","applied":959,"progress":{"5":{"match":959,"next":960,"state":"StateReplicate"},"7":{"match":959,"next":960,"state":"StateReplicate"},"b":{"match":959,"next":960,"state":"StateReplicate"},"d":{"match":959,"next":960,"state":"StateReplicate"},"e":{"match":959,"next":960,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch /Local/Range/Table/106/1/6532837936193842192/RangeDescriptor@0,0
	Test:       	failover/chaos/read-write
(require.go:1360).NoError: FailNow called
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
(cluster.go:2279).Run: cluster.RunE: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-write/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 29, 2023

The last failure is in part caused by circuit breakers getting stuck tripping on a reproposal that never goes through, even though the Raft group is fully functional: #105797.

We also see it seemingly poisoning a meta2 latch outside of the replica's keyspan: #105798.

@erikgrinaker erikgrinaker added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 29, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jul 4, 2023
Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen:

- conf change 1 gets evaluated on a leaseholder n1
- lease changes
- new leaseholder evaluates and commits conf change 2
- n1 receives and applies conf change 2
- conf change 1 gets added to the proposal buffer and flushed; RawNode rejects
  it because conf change 1 is not compatible on top of conf change 2

Prior to this commit, because raft silently replaces the conf change with an
empty entry, we would never see the proposal below raft (where it would be
rejected due to the lease change). In effect, this caused replica unavailability
because the proposal and the associated latch would stick around forever, and the
replica circuit breaker would trip.

This commit provides a targeted fix: when the proposal buffer flushes a conf
change to raft, we check if it got replaced with an empty entry. If so, we
properly finish the proposal. To be conservative, we signal it with an ambiguous
result: it seems conceivable that the rejection would only occur on a
reproposal, while the original proposal made it into raft before the lease
change, and the local replica is in fact behind on conf changes rather than
ahead (which can happen if it's a follower). The only "customer" here is the
replicate queue (and scatter, etc) so this is acceptable; any choice here would
necessarily be a "hard error" anyway.

Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.

Fixes cockroachdb#105797.
Closes cockroachdb#104709.
tbg added a commit to tbg/cockroach that referenced this issue Jul 4, 2023
Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen:

- conf change 1 gets evaluated on a leaseholder n1
- lease changes
- new leaseholder evaluates and commits conf change 2
- n1 receives and applies conf change 2
- conf change 1 gets added to the proposal buffer and flushed; RawNode rejects
  it because conf change 1 is not compatible on top of conf change 2

Prior to this commit, because raft silently replaces the conf change with an
empty entry, we would never see the proposal below raft (where it would be
rejected due to the lease change). In effect, this caused replica unavailability
because the proposal and the associated latch would stick around forever, and the
replica circuit breaker would trip.

This commit provides a targeted fix: when the proposal buffer flushes a conf
change to raft, we check if it got replaced with an empty entry. If so, we
properly finish the proposal. To be conservative, we signal it with an ambiguous
result: it seems conceivable that the rejection would only occur on a
reproposal, while the original proposal made it into raft before the lease
change, and the local replica is in fact behind on conf changes rather than
ahead (which can happen if it's a follower). The only "customer" here is the
replicate queue (and scatter, etc) so this is acceptable; any choice here would
necessarily be a "hard error" anyway.

Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.

Fixes cockroachdb#105797.
Closes cockroachdb#104709.
craig bot pushed a commit that referenced this issue Jul 5, 2023
102248: kvserver: tighten and de-fatal replicaGC checks r=erikgrinaker a=tbg

Now we return most failed checks as errors and report errors to sentry (but
without crashing).

Closes #94813.

Epic: None
Release note: None


106104: kvserver: fail stale ConfChange when rejected by raft r=erikgrinaker a=tbg

Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen:

- conf change 1 gets evaluated on a leaseholder n1
- lease changes
- new leaseholder evaluates and commits conf change 2
- n1 receives and applies conf change 2
- conf change 1 gets added to the proposal buffer and flushed; RawNode rejects
  it because conf change 1 is not compatible on top of conf change 2

Prior to this commit, because raft silently replaces the conf change with an
empty entry, we would never see the proposal below raft (where it would be
rejected due to the lease change). In effect, this caused replica unavailability
because the proposal and the associated latch would stick around forever, and the
replica circuit breaker would trip.

This commit provides a targeted fix: when the proposal buffer flushes a conf
change to raft, we check if it got replaced with an empty entry. If so, we
properly finish the proposal. To be conservative, we signal it with an ambiguous
result: it seems conceivable that the rejection would only occur on a
reproposal, while the original proposal made it into raft before the lease
change, and the local replica is in fact behind on conf changes rather than
ahead (which can happen if it's a follower). The only "customer" here is the
replicate queue (and scatter, etc) so this is acceptable; any choice here would
necessarily be a "hard error" anyway.

Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.

Fixes #105797.
Closes #104709.


Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in 93117db Jul 5, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jul 5, 2023
Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen:

- conf change 1 gets evaluated on a leaseholder n1
- lease changes
- new leaseholder evaluates and commits conf change 2
- n1 receives and applies conf change 2
- conf change 1 gets added to the proposal buffer and flushed; RawNode rejects
  it because conf change 1 is not compatible on top of conf change 2

Prior to this commit, because raft silently replaces the conf change with an
empty entry, we would never see the proposal below raft (where it would be
rejected due to the lease change). In effect, this caused replica unavailability
because the proposal and the associated latch would stick around forever, and the
replica circuit breaker would trip.

This commit provides a targeted fix: when the proposal buffer flushes a conf
change to raft, we check if it got replaced with an empty entry. If so, we
properly finish the proposal. To be conservative, we signal it with an ambiguous
result: it seems conceivable that the rejection would only occur on a
reproposal, while the original proposal made it into raft before the lease
change, and the local replica is in fact behind on conf changes rather than
ahead (which can happen if it's a follower). The only "customer" here is the
replicate queue (and scatter, etc) so this is acceptable; any choice here would
necessarily be a "hard error" anyway.

Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.

Fixes cockroachdb#105797.
Closes cockroachdb#104709.
tbg added a commit to tbg/cockroach that referenced this issue Jul 5, 2023
Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen:

- conf change 1 gets evaluated on a leaseholder n1
- lease changes
- new leaseholder evaluates and commits conf change 2
- n1 receives and applies conf change 2
- conf change 1 gets added to the proposal buffer and flushed; RawNode rejects
  it because conf change 1 is not compatible on top of conf change 2

Prior to this commit, because raft silently replaces the conf change with an
empty entry, we would never see the proposal below raft (where it would be
rejected due to the lease change). In effect, this caused replica unavailability
because the proposal and the associated latch would stick around forever, and the
replica circuit breaker would trip.

This commit provides a targeted fix: when the proposal buffer flushes a conf
change to raft, we check if it got replaced with an empty entry. If so, we
properly finish the proposal. To be conservative, we signal it with an ambiguous
result: it seems conceivable that the rejection would only occur on a
reproposal, while the original proposal made it into raft before the lease
change, and the local replica is in fact behind on conf changes rather than
ahead (which can happen if it's a follower). The only "customer" here is the
replicate queue (and scatter, etc) so this is acceptable; any choice here would
necessarily be a "hard error" anyway.

Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.

Fixes cockroachdb#105797.
Closes cockroachdb#104709.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants