-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: failover/chaos/read-write failed #104709
Comments
Something seems a bit messed up here. We failed on a I'll dig into this later.
|
cc @cockroachdb/replication |
roachtest.failover/chaos/read-write failed with artifacts on master @ 9633594f139d80d1555127a09f1ecd12dd1d9b11:
Parameters: |
There's a lot of other stuff going on here in the latest failure. The async rangelog updates we added in #102813 appear to be failing reliably during a scatter, with hundreds of these:
We're also seeing tons of campaign failures, which may either be related to the above, or fallout from #104969:
And theres tons of txn heartbeat failures due to using a canceled context:
And a conf change that's been stalled for over 10 minutes:
I'm going to try to disentangle this mess. |
I'm seeing both the campaign failures and HeartbeatTxn failures prior to both of those PRs, so it's not new. This would also go some way towards explaining why the rangelog writes fail, if we're temporarily losing Raft leadership. Bumping the timeout to 10 seconds would likely help, I'll give that a try. However, I'm not seeing the "pending configuration changes" error on 23.1, but I am seeing a few similar "9 is unpromotable and can not campaign" errors. I'm not seeing any of the HeartbeatTxn errors on 23.1. |
roachtest.failover/chaos/read-write failed with artifacts on master @ 1eb628e8c8a7eb1cbf9bfa2bd6c31982c25cbba0:
Parameters: |
Last failure is #105274. |
roachtest.failover/chaos/read-write failed with artifacts on master @ cb6dd27fdfbfc8a9d0d4753809f4588bae8060d0:
Parameters: |
Opened #105413 for the rangelog timeouts. |
roachtest.failover/chaos/read-write failed with artifacts on master @ 7fd4c21157221eae9e7d5892d89d2b5a671aba3e:
Parameters: |
roachtest.failover/chaos/read-write failed with artifacts on master @ 0b0a212ecbe514b55301f4a963edc92b16de9bab:
Parameters: |
Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen: - conf change 1 gets evaluated on a leaseholder n1 - lease changes - new leaseholder evaluates and commits conf change 2 - n1 receives and applies conf change 2 - conf change 1 gets added to the proposal buffer and flushed; RawNode rejects it because conf change 1 is not compatible on top of conf change 2 Prior to this commit, because raft silently replaces the conf change with an empty entry, we would never see the proposal below raft (where it would be rejected due to the lease change). In effect, this caused replica unavailability because the proposal and the associated latch would stick around forever, and the replica circuit breaker would trip. This commit provides a targeted fix: when the proposal buffer flushes a conf change to raft, we check if it got replaced with an empty entry. If so, we properly finish the proposal. To be conservative, we signal it with an ambiguous result: it seems conceivable that the rejection would only occur on a reproposal, while the original proposal made it into raft before the lease change, and the local replica is in fact behind on conf changes rather than ahead (which can happen if it's a follower). The only "customer" here is the replicate queue (and scatter, etc) so this is acceptable; any choice here would necessarily be a "hard error" anyway. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Fixes cockroachdb#105797. Closes cockroachdb#104709.
Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen: - conf change 1 gets evaluated on a leaseholder n1 - lease changes - new leaseholder evaluates and commits conf change 2 - n1 receives and applies conf change 2 - conf change 1 gets added to the proposal buffer and flushed; RawNode rejects it because conf change 1 is not compatible on top of conf change 2 Prior to this commit, because raft silently replaces the conf change with an empty entry, we would never see the proposal below raft (where it would be rejected due to the lease change). In effect, this caused replica unavailability because the proposal and the associated latch would stick around forever, and the replica circuit breaker would trip. This commit provides a targeted fix: when the proposal buffer flushes a conf change to raft, we check if it got replaced with an empty entry. If so, we properly finish the proposal. To be conservative, we signal it with an ambiguous result: it seems conceivable that the rejection would only occur on a reproposal, while the original proposal made it into raft before the lease change, and the local replica is in fact behind on conf changes rather than ahead (which can happen if it's a follower). The only "customer" here is the replicate queue (and scatter, etc) so this is acceptable; any choice here would necessarily be a "hard error" anyway. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Fixes cockroachdb#105797. Closes cockroachdb#104709.
102248: kvserver: tighten and de-fatal replicaGC checks r=erikgrinaker a=tbg Now we return most failed checks as errors and report errors to sentry (but without crashing). Closes #94813. Epic: None Release note: None 106104: kvserver: fail stale ConfChange when rejected by raft r=erikgrinaker a=tbg Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen: - conf change 1 gets evaluated on a leaseholder n1 - lease changes - new leaseholder evaluates and commits conf change 2 - n1 receives and applies conf change 2 - conf change 1 gets added to the proposal buffer and flushed; RawNode rejects it because conf change 1 is not compatible on top of conf change 2 Prior to this commit, because raft silently replaces the conf change with an empty entry, we would never see the proposal below raft (where it would be rejected due to the lease change). In effect, this caused replica unavailability because the proposal and the associated latch would stick around forever, and the replica circuit breaker would trip. This commit provides a targeted fix: when the proposal buffer flushes a conf change to raft, we check if it got replaced with an empty entry. If so, we properly finish the proposal. To be conservative, we signal it with an ambiguous result: it seems conceivable that the rejection would only occur on a reproposal, while the original proposal made it into raft before the lease change, and the local replica is in fact behind on conf changes rather than ahead (which can happen if it's a follower). The only "customer" here is the replicate queue (and scatter, etc) so this is acceptable; any choice here would necessarily be a "hard error" anyway. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Fixes #105797. Closes #104709. Co-authored-by: Tobias Grieger <[email protected]>
Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen: - conf change 1 gets evaluated on a leaseholder n1 - lease changes - new leaseholder evaluates and commits conf change 2 - n1 receives and applies conf change 2 - conf change 1 gets added to the proposal buffer and flushed; RawNode rejects it because conf change 1 is not compatible on top of conf change 2 Prior to this commit, because raft silently replaces the conf change with an empty entry, we would never see the proposal below raft (where it would be rejected due to the lease change). In effect, this caused replica unavailability because the proposal and the associated latch would stick around forever, and the replica circuit breaker would trip. This commit provides a targeted fix: when the proposal buffer flushes a conf change to raft, we check if it got replaced with an empty entry. If so, we properly finish the proposal. To be conservative, we signal it with an ambiguous result: it seems conceivable that the rejection would only occur on a reproposal, while the original proposal made it into raft before the lease change, and the local replica is in fact behind on conf changes rather than ahead (which can happen if it's a follower). The only "customer" here is the replicate queue (and scatter, etc) so this is acceptable; any choice here would necessarily be a "hard error" anyway. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Fixes cockroachdb#105797. Closes cockroachdb#104709.
Because etcd/raft activates configuration changes when they are applied, but checks new proposed configs before they are considered for adding to the log (or forwarding to the leader), the following can happen: - conf change 1 gets evaluated on a leaseholder n1 - lease changes - new leaseholder evaluates and commits conf change 2 - n1 receives and applies conf change 2 - conf change 1 gets added to the proposal buffer and flushed; RawNode rejects it because conf change 1 is not compatible on top of conf change 2 Prior to this commit, because raft silently replaces the conf change with an empty entry, we would never see the proposal below raft (where it would be rejected due to the lease change). In effect, this caused replica unavailability because the proposal and the associated latch would stick around forever, and the replica circuit breaker would trip. This commit provides a targeted fix: when the proposal buffer flushes a conf change to raft, we check if it got replaced with an empty entry. If so, we properly finish the proposal. To be conservative, we signal it with an ambiguous result: it seems conceivable that the rejection would only occur on a reproposal, while the original proposal made it into raft before the lease change, and the local replica is in fact behind on conf changes rather than ahead (which can happen if it's a follower). The only "customer" here is the replicate queue (and scatter, etc) so this is acceptable; any choice here would necessarily be a "hard error" anyway. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Fixes cockroachdb#105797. Closes cockroachdb#104709.
roachtest.failover/chaos/read-write failed with artifacts on master @ 6e38f676273d08b3a111213d6c9f7945629228d7:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=2
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-28685
Epic CRDB-27234
The text was updated successfully, but these errors were encountered: