Fix Snapshot Corruption in Edge Case #47552

original-brownbear · 2019-10-04T06:38:36Z

This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

As far as I can tell this bug exists way back to at least v6.5.
In practice this is not so severe as it only corrupts PARTIAL snapshots. But if the shards incorrectly marked SUCCESS make up a complete index then the snapshot gives the false impression of being restorable. Also, snapshot status APIs that load the snap- blob in these shards will fail because the snap- blob isn't there as expected.

Not sure this justifies a back port to 6.8.x but it seems like it's severe enough? Back-porting this would be a little tricky since ideally we'd add a new test for this using a real IT since we don't have the resiliency tests in 6.8.x.

Relates #47550 (not closing since the issue that the test isn't 100% deterministic remains)

Also relates #46250 for which this issue would be catastrophic.

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Relates elastic#47550 (not closing since the issue that the test isn't 100% deterministic remains)

elasticmachine · 2019-10-04T06:38:38Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2019-10-04T07:14:49Z

Jenkins run elasticsearch-ci/1

ywelsch

great find

ywelsch · 2019-10-04T07:37:03Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

@@ -829,6 +832,8 @@ public void onFailure(Exception e) {
                            }
                        }, updatedSnapshot.getRepositoryStateId(), false);
                    }
+                    assert updatedSnapshot.shards().size() == snapshot.shards().size()


Can we assert in the constructor of SnapshotsInProgress.Entry that for every entry in "indices" there is at least one entry in the shards map and vice versa?

Yea that sounds good, maybe do it in a follow-up since it requires a bit of a rewrite of org.elasticsearch.snapshots.SnapshotsInProgressSerializationTests#randomSnapshot (it creates all kinds of bogus instances wehre the indices don't match up) and it might be nice to have this change isolated?

ywelsch · 2019-10-04T07:40:07Z

I also think this should go to 6.8

original-brownbear · 2019-10-04T07:40:30Z

Jenkins run elasticsearch-ci/1 (failed to build some Docker image)

original-brownbear · 2019-10-04T08:42:35Z

Thanks @ywelsch!

Assert given input shards and indices are consistent. Also, fixed the equality check for SnapshotsInProgress. Before this change the tests never had more than a single waiting shard per index so they never failed as a result of the waiting shards list not being ordered. Follow up to elastic#47552

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes elastic#47550

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes #47550

Adding a specific integration test that reproduces the problem fixed in elastic#47552. The issue fixed only reproduces in the snapshot resiliency otherwise which are not available in 6.8 where the fix is being backported to as well.

Assert given input shards and indices are consistent. Also, fixed the equality check for SnapshotsInProgress. Before this change the tests never had more than a single waiting shard per index so they never failed as a result of the waiting shards list not being ordered. Follow up to #47552

Assert given input shards and indices are consistent. Also, fixed the equality check for SnapshotsInProgress. Before this change the tests never had more than a single waiting shard per index so they never failed as a result of the waiting shards list not being ordered. Follow up to elastic#47552

Adding a specific integration test that reproduces the problem fixed in #47552. The issue fixed only reproduces in the snapshot resiliency otherwise which are not available in 6.8 where the fix is being backported to as well.

Adding a specific integration test that reproduces the problem fixed in elastic#47552. The issue fixed only reproduces in the snapshot resiliency otherwise which are not available in 6.8 where the fix is being backported to as well.

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes elastic#47550

Assert given input shards and indices are consistent. Also, fixed the equality check for SnapshotsInProgress. Before this change the tests never had more than a single waiting shard per index so they never failed as a result of the waiting shards list not being ordered. Follow up to #47552

* Add IT for Snapshot Issue in 47552 (#47627) Adding a specific integration test that reproduces the problem fixed in #47552. The issue fixed only reproduces in the snapshot resiliency otherwise which are not available in 6.8 where the fix is being backported to as well.

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes #47550

This pull request is a backport of elastic/elasticsearch#47552 The purpose of this pull request is to track shard snapshots and mark them as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster.

This pull request is a backport of elastic/elasticsearch#47552 combined with the related follow up backport of elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster.

This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above.

This pull request is a backport of two closely related pull requests: elastic/elasticsearch#47552 elastic/elasticsearch#47598 The purpose of this pull request is to track shard snapshots and mark them as failed when either multiple data-nodes are lost during the snapshot process or shard snapshot failures occure before a node left the cluster. Before this was not the case, so a failed shard would not have been marked as failed during the snapshot finalization. The problem is fixed by correctly keeping track of all previous completed shard states as well in this case and add a consistency assertion to SnapshotsInProgress. More details can be found in the original prs mentioned above. (cherry picked from commit ed3aea5)

original-brownbear added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.5.0 v6.8.4 v7.4.1 labels Oct 4, 2019

ywelsch approved these changes Oct 4, 2019

View reviewed changes

original-brownbear merged commit e244d65 into elastic:master Oct 4, 2019

original-brownbear deleted the 47550 branch October 4, 2019 08:42

original-brownbear added the backport pending label Oct 4, 2019

original-brownbear mentioned this pull request Oct 4, 2019

Add Consistency Assertion to SnapshotsInProgress #47598

Merged

original-brownbear mentioned this pull request Oct 5, 2019

Fix Snapshot Corruption in Edge Case (#47552) #47620

Merged

original-brownbear mentioned this pull request Oct 5, 2019

Fix Snapshot Corruption in Edge Case (#47552) #47621

Merged

original-brownbear mentioned this pull request Oct 6, 2019

Add IT for Snapshot Issue in 47552 #47627

Merged

original-brownbear mentioned this pull request Oct 7, 2019

Add Consistency Assertion to SnapshotsInProgress (#47598) #47633

Merged

original-brownbear mentioned this pull request Oct 7, 2019

Add IT for Snapshot Issue in 47552 (#47627) #47634

Merged

original-brownbear mentioned this pull request Oct 7, 2019

Add IT for Snapshot Issue in 47552 (#47627) #47635

Merged

original-brownbear mentioned this pull request Oct 7, 2019

Fix Snapshot Corruption in Edge Case (#47552) #47636

Merged

original-brownbear removed the backport pending label Oct 7, 2019

codebrain mentioned this pull request Oct 25, 2019

7.4.1 meta ticket elastic/elasticsearch-net#4174

Closed

39 tasks

mkleen mentioned this pull request Nov 27, 2019

Improve Snapshot Corruption handling in Edge Case crate/crate#9389

Merged

5 tasks

original-brownbear restored the 47550 branch August 6, 2020 18:27

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Snapshot Corruption in Edge Case #47552

Fix Snapshot Corruption in Edge Case #47552

original-brownbear commented Oct 4, 2019 •

edited

Loading

elasticmachine commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

ywelsch left a comment

ywelsch Oct 4, 2019

original-brownbear Oct 4, 2019

ywelsch Oct 4, 2019

original-brownbear Oct 5, 2019

ywelsch commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

Fix Snapshot Corruption in Edge Case #47552

Fix Snapshot Corruption in Edge Case #47552

Conversation

original-brownbear commented Oct 4, 2019 • edited Loading

elasticmachine commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Oct 4, 2019

Choose a reason for hiding this comment

original-brownbear Oct 4, 2019

Choose a reason for hiding this comment

ywelsch Oct 4, 2019

Choose a reason for hiding this comment

original-brownbear Oct 5, 2019

Choose a reason for hiding this comment

ywelsch commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

original-brownbear commented Oct 4, 2019 •

edited

Loading