-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Snapshot Corruption in Edge Case #47552
Conversation
This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Relates elastic#47550 (not closing since the issue that the test isn't 100% deterministic remains)
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
Jenkins run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great find
@@ -829,6 +832,8 @@ public void onFailure(Exception e) { | |||
} | |||
}, updatedSnapshot.getRepositoryStateId(), false); | |||
} | |||
assert updatedSnapshot.shards().size() == snapshot.shards().size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert in the constructor of SnapshotsInProgress.Entry that for every entry in "indices" there is at least one entry in the shards map and vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that sounds good, maybe do it in a follow-up since it requires a bit of a rewrite of org.elasticsearch.snapshots.SnapshotsInProgressSerializationTests#randomSnapshot
(it creates all kinds of bogus instances wehre the indices don't match up) and it might be nice to have this change isolated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this should go to 6.8 |
Jenkins run elasticsearch-ci/1 (failed to build some Docker image) |
Thanks @ywelsch! |
Assert given input shards and indices are consistent. Also, fixed the equality check for SnapshotsInProgress. Before this change the tests never had more than a single waiting shard per index so they never failed as a result of the waiting shards list not being ordered. Follow up to elastic#47552
This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes elastic#47550
This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes elastic#47550
This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes #47550
This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes #47550
Adding a specific integration test that reproduces the problem fixed in elastic#47552. The issue fixed only reproduces in the snapshot resiliency otherwise which are not available in 6.8 where the fix is being backported to as well.
Assert given input shards and indices are consistent. Also, fixed the equality check for SnapshotsInProgress. Before this change the tests never had more than a single waiting shard per index so they never failed as a result of the waiting shards list not being ordered. Follow up to #47552
Assert given input shards and indices are consistent. Also, fixed the equality check for SnapshotsInProgress. Before this change the tests never had more than a single waiting shard per index so they never failed as a result of the waiting shards list not being ordered. Follow up to elastic#47552
Adding a specific integration test that reproduces the problem fixed in #47552. The issue fixed only reproduces in the snapshot resiliency otherwise which are not available in 6.8 where the fix is being backported to as well.
Adding a specific integration test that reproduces the problem fixed in elastic#47552. The issue fixed only reproduces in the snapshot resiliency otherwise which are not available in 6.8 where the fix is being backported to as well.
Adding a specific integration test that reproduces the problem fixed in elastic#47552. The issue fixed only reproduces in the snapshot resiliency otherwise which are not available in 6.8 where the fix is being backported to as well.
This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes elastic#47550
Assert given input shards and indices are consistent. Also, fixed the equality check for SnapshotsInProgress. Before this change the tests never had more than a single waiting shard per index so they never failed as a result of the waiting shards list not being ordered. Follow up to #47552
This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes #47550
This pull request is a backport of elastic/elasticsearch#47552 The purpose of this pull request is to track shard snapshots and mark them as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster.
This pull request is a backport of elastic/elasticsearch#47552 combined with the related follow up backport of elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster.
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above.
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above.
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above.
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above.
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above.
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above.
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above.
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above. (cherry picked from commit ed3aea5)
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above. (cherry picked from commit ed3aea5)
This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above. (cherry picked from commit ed3aea5)
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.
The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.
As far as I can tell this bug exists way back to at least
v6.5
.In practice this is not so severe as it only corrupts
PARTIAL
snapshots. But if the shards incorrectly markedSUCCESS
make up a complete index then the snapshot gives the false impression of being restorable. Also, snapshot status APIs that load thesnap-
blob in these shards will fail because thesnap-
blob isn't there as expected.Not sure this justifies a back port to
6.8.x
but it seems like it's severe enough? Back-porting this would be a little tricky since ideally we'd add a new test for this using a real IT since we don't have the resiliency tests in6.8.x
.Relates #47550 (not closing since the issue that the test isn't 100% deterministic remains)
Also relates #46250 for which this issue would be catastrophic.