-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: clearrange/checks=false failed #38772
Comments
Tons of RocksDB stalls like:
@ajkr noticed this in #38095 (comment). I closed that issue because part of it was fixed, but this still needs to be tracked. |
SHA: https://github.com/cockroachdb/cockroach/commits/1ca35fc4a0e2665e7f6efd945e65a0db97984fa7 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1396096&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/22d48caaa7d39efdcef7b3c87a99fc421e1473af Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1397412&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/7111a67b2ea3a19c2f312f8d214b8823f431cac0 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1400942&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/26edea51118a0e16b61748c08068bfa6f76543ca Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404886&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/9078c4e63c1bff1c3d220ee216000b0903dd4d65 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1406479&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/cfdaadc3514e7e8660f6c009ba159fdfd604f0a8 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1409070&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/65055d6c16bf9386d8c4f4f9cd23e0a848814dc9 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1411157&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/92fef12128c997233d985d1c19e11faac005073f Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1413388&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/da56c792e968574b8f1d9ef3fdb45d56a530221a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1415578&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/5bd37e8eb58ca66b9293c234bc572411057fec3a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417287&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/175c5ada040fd0cbbf178636b1c551d5c2229ec4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417597&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/3b9a95bd7eb2cfa6d544fe7217852a85ec3b76f4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1422703&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/3db89b230b0c41e399354cbeb78c1e82c8e30004 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1424320&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/51a6fdedf0ce1d1329d40d801a7deaf8206b6b07 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1428934&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/51a6fdedf0ce1d1329d40d801a7deaf8206b6b07 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1436116&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/01ee0704865391599abef3bbc89f462117f8007a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1445527&tab=buildLog
|
I backported cockroachdb/rocksdb#43 and cockroachdb/rocksdb#42 to RocksDB 5.17.2 and the import is chugging along now. #43 in particular could explain the consistency checker issue on 5.17.2. I'll need to run this a bunch of times to be sure, though. |
It's possible that there was some surprising interaction between ingested ssts and ssts that came originally from the memtable, but note that neither here nor in #40213 did the dropped range deletion tombstone come from an ingested sst. |
I think I might have confused the consistencyChecker complaining about stats needing to be refreshed with an actual consistency failure. |
SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1488785&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1489712&tab=buildLog
|
Picks up cockroachdb/rocksdb#56. Release justification: This feature can cause a corruption where keys deleted by range tombstones reappear (see cockroachdb#38772 and cockroachdb#40213), so it's important we revert it. Release note: None
40899: c-deps: bump rocksdb to revert compaction snapshot refresh r=ajkr a=ajkr Picks up cockroachdb/rocksdb#56. Release justification: This feature can cause a corruption where keys deleted by range tombstones reappear (see #38772 and #40213), so it's important we revert it. Release note: None Co-authored-by: Andrew Kryczka <[email protected]>
Thanks for narrowing down the consistency check failure to a range deletion bug, @nvanbenschoten. I never would have figured that out. |
Thank you for taking it the rest of the way! It feels good to knock down two release blockers with one stone. |
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change.
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change.
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change.
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
SHA: https://github.com/cockroachdb/cockroach/commits/8c6fdc64908a13291e4ddc5d233bbbaa379e71a2
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1378458&tab=buildLog
The text was updated successfully, but these errors were encountered: