Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Snapshot Corruption in Edge Case #47552

Merged
merged 1 commit into from
Oct 4, 2019

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Oct 4, 2019

This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

As far as I can tell this bug exists way back to at least v6.5.
In practice this is not so severe as it only corrupts PARTIAL snapshots. But if the shards incorrectly marked SUCCESS make up a complete index then the snapshot gives the false impression of being restorable. Also, snapshot status APIs that load the snap- blob in these shards will fail because the snap- blob isn't there as expected.

Not sure this justifies a back port to 6.8.x but it seems like it's severe enough? Back-porting this would be a little tricky since ideally we'd add a new test for this using a real IT since we don't have the resiliency tests in 6.8.x.

Relates #47550 (not closing since the issue that the test isn't 100% deterministic remains)

Also relates #46250 for which this issue would be catastrophic.

This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Relates elastic#47550 (not closing since the issue that the test isn't 100% deterministic remains)
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great find

@@ -829,6 +832,8 @@ public void onFailure(Exception e) {
}
}, updatedSnapshot.getRepositoryStateId(), false);
}
assert updatedSnapshot.shards().size() == snapshot.shards().size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert in the constructor of SnapshotsInProgress.Entry that for every entry in "indices" there is at least one entry in the shards map and vice versa?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that sounds good, maybe do it in a follow-up since it requires a bit of a rewrite of org.elasticsearch.snapshots.SnapshotsInProgressSerializationTests#randomSnapshot (it creates all kinds of bogus instances wehre the indices don't match up) and it might be nice to have this change isolated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch
Copy link
Contributor

ywelsch commented Oct 4, 2019

I also think this should go to 6.8

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1 (failed to build some Docker image)

@original-brownbear
Copy link
Member Author

Thanks @ywelsch!

@original-brownbear original-brownbear merged commit e244d65 into elastic:master Oct 4, 2019
@original-brownbear original-brownbear deleted the 47550 branch October 4, 2019 08:42
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 4, 2019
Assert given input shards and indices are consistent.
Also, fixed the equality check for SnapshotsInProgress.
Before this change the tests never had more than a single waiting
shard per index so they never failed as a result of the
waiting shards list not being ordered.

Follow up to elastic#47552
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 5, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes elastic#47550
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 5, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes elastic#47550
original-brownbear added a commit that referenced this pull request Oct 5, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes #47550
original-brownbear added a commit that referenced this pull request Oct 5, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes #47550
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 6, 2019
Adding a specific integration test that reproduces the problem
fixed in elastic#47552. The issue fixed only reproduces in the snapshot
resiliency otherwise which are not available in 6.8 where the
fix is being backported to as well.
original-brownbear added a commit that referenced this pull request Oct 7, 2019
Assert given input shards and indices are consistent.
Also, fixed the equality check for SnapshotsInProgress.
Before this change the tests never had more than a single waiting
shard per index so they never failed as a result of the
waiting shards list not being ordered.

Follow up to #47552
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 7, 2019
Assert given input shards and indices are consistent.
Also, fixed the equality check for SnapshotsInProgress.
Before this change the tests never had more than a single waiting
shard per index so they never failed as a result of the
waiting shards list not being ordered.

Follow up to elastic#47552
original-brownbear added a commit that referenced this pull request Oct 7, 2019
Adding a specific integration test that reproduces the problem
fixed in #47552. The issue fixed only reproduces in the snapshot
resiliency otherwise which are not available in 6.8 where the
fix is being backported to as well.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 7, 2019
Adding a specific integration test that reproduces the problem
fixed in elastic#47552. The issue fixed only reproduces in the snapshot
resiliency otherwise which are not available in 6.8 where the
fix is being backported to as well.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 7, 2019
Adding a specific integration test that reproduces the problem
fixed in elastic#47552. The issue fixed only reproduces in the snapshot
resiliency otherwise which are not available in 6.8 where the
fix is being backported to as well.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 7, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes elastic#47550
original-brownbear added a commit that referenced this pull request Oct 7, 2019
Assert given input shards and indices are consistent.
Also, fixed the equality check for SnapshotsInProgress.
Before this change the tests never had more than a single waiting
shard per index so they never failed as a result of the
waiting shards list not being ordered.

Follow up to #47552
original-brownbear added a commit that referenced this pull request Oct 7, 2019
* Add IT for Snapshot Issue in 47552 (#47627)

Adding a specific integration test that reproduces the problem
fixed in #47552. The issue fixed only reproduces in the snapshot
resiliency otherwise which are not available in 6.8 where the
fix is being backported to as well.
original-brownbear added a commit that referenced this pull request Oct 7, 2019
* Add IT for Snapshot Issue in 47552 (#47627)

Adding a specific integration test that reproduces the problem
fixed in #47552. The issue fixed only reproduces in the snapshot
resiliency otherwise which are not available in 6.8 where the
fix is being backported to as well.
original-brownbear added a commit that referenced this pull request Oct 7, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes #47550
mkleen added a commit to crate/crate that referenced this pull request Nov 27, 2019
This pull request is a backport of elastic/elasticsearch#47552

The purpose of this pull request is to track shard snapshots and mark
them as failures when multiple data-nodes are lost during the snapshot
process or shard snapshot failures have occured before a node left the
cluster.
mkleen added a commit to crate/crate that referenced this pull request Nov 27, 2019
This pull request is a backport of elastic/elasticsearch#47552
combined with the related follow up backport of
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failures when multiple data-nodes are lost during the snapshot
process or shard snapshot failures have occured before a node left the
cluster.
mkleen added a commit to crate/crate that referenced this pull request Nov 27, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.
mkleen added a commit to crate/crate that referenced this pull request Nov 27, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.
mkleen added a commit to crate/crate that referenced this pull request Nov 27, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.
mkleen added a commit to crate/crate that referenced this pull request Nov 27, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.
mkleen added a commit to crate/crate that referenced this pull request Nov 28, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.
mkleen added a commit to crate/crate that referenced this pull request Dec 2, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.
mergify bot pushed a commit to crate/crate that referenced this pull request Dec 2, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.
mergify bot pushed a commit to crate/crate that referenced this pull request Dec 2, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.

(cherry picked from commit ed3aea5)
mkleen added a commit to crate/crate that referenced this pull request Dec 5, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.

(cherry picked from commit ed3aea5)
mergify bot pushed a commit to crate/crate that referenced this pull request Dec 5, 2019
This pull request is a backport of two closely related pull requests:
elastic/elasticsearch#47552
elastic/elasticsearch#47598

The purpose of this pull request is to track shard snapshots and mark
them as failed when either multiple data-nodes are lost during the snapshot
process or shard snapshot failures occure before a node left the
cluster. Before this was not the case, so a failed shard would not have been
marked as failed during the snapshot finalization.

The problem is fixed by correctly keeping track of all previous completed
shard states as well in this case and add a consistency assertion to
SnapshotsInProgress. More details can be found in the original prs
mentioned above.

(cherry picked from commit ed3aea5)
@original-brownbear original-brownbear restored the 47550 branch August 6, 2020 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants