Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Snapshot State Machine Issues around Failed Clones #76419

Merged

Conversation

original-brownbear
Copy link
Member

With recent fixes it is never correct to simply remove a snapshot from the cluster state without
updating other snapshot entries if an entry contains any successful shards due to possible dependencies.
This change reproduces two issues resulting from simply removing snapshot without regard for other queued
operations and fixes them by having all removal of snapshot from the cluster state go through the same
code path.
Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing
snapshots that forces them to be in this collection.

With recent fixes it is never correct to simply remove a snapshot from the cluster state without
updating other snapshot entries if an entry contains any successful shards due to possible dependencies.
This change reproduces two issues resulting from simply removing snapshot without regard for other queued
operations and fixes them by having all removal of snapshot from the cluster state go through the same
code path.
Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing
snapshots that forces them to be in this collection.
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 12, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

if (entry.isClone() && entry.state() == State.FAILED) {
logger.debug("Removing failed snapshot clone [{}] from cluster state", entry);
removeFailedSnapshotFromClusterState(snapshot, new SnapshotException(snapshot, entry.failure()), null);
if (newFinalization) {
removeFailedSnapshotFromClusterState(snapshot, new SnapshotException(snapshot, entry.failure()), null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do a possible follow-up here to remove data from partially failed clones from the repo. Since clones only really fail like this due to IO exceptions and since IO exceptions are really unlikely unless something is really broken I'm not sure it's worth the effort though because the cleanup will probably also fail (and would happen on a subsequent delete run). Since we aren't writing a new index-N (so this isn't relevant to the scalability of the repo really) and clones by their very nature add almost no bytes to the repo I think for now this is good enough.

blockOnWriteShardLevelMeta = true;
}

public void setBlockAndFailOnWriteShardLevelMeta() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not great to add yet another method to this, but this case turns out to be important I guess. Reproducing this like we used to via blocking and then restarting master simply did not cover this obviously broken spot.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but a question about test coverage

if (entry.isClone() && entry.state() == State.FAILED) {
logger.debug("Removing failed snapshot clone [{}] from cluster state", entry);
removeFailedSnapshotFromClusterState(snapshot, new SnapshotException(snapshot, entry.failure()), null);
if (newFinalization) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test for this new condition? I deleted it and ran a few likely-looking suites and also all of :server:ictest but didn't see any failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not, I'm not sure how I could deterministically reproduce this at the moment. You can technically get here in some master fails over twice in a row scenarios. Though that would need to happen concurrently with finalizing a snapshot queued before the clone ... I'll think about a way to set this up, just added this here for now to be defensive and not create pointless CS updates (technically these updates should be idempotent anyway because they become noops as soon as the snapshot to be removed isn't in the CS).

@original-brownbear
Copy link
Member Author

Thanks David!

@original-brownbear original-brownbear merged commit 1f080e3 into elastic:master Aug 12, 2021
@original-brownbear original-brownbear deleted the fix-failure-edge-case branch August 12, 2021 18:41
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 17, 2021
With recent fixes it is never correct to simply remove a snapshot from the cluster state without
updating other snapshot entries if an entry contains any successful shards due to possible dependencies.
This change reproduces two issues resulting from simply removing snapshot without regard for other queued
operations and fixes them by having all removal of snapshot from the cluster state go through the same
code path.
Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing
snapshots that forces them to be in this collection.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 17, 2021
With recent fixes it is never correct to simply remove a snapshot from the cluster state without
updating other snapshot entries if an entry contains any successful shards due to possible dependencies.
This change reproduces two issues resulting from simply removing snapshot without regard for other queued
operations and fixes them by having all removal of snapshot from the cluster state go through the same
code path.
Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing
snapshots that forces them to be in this collection.
original-brownbear added a commit that referenced this pull request Aug 17, 2021
With recent fixes it is never correct to simply remove a snapshot from the cluster state without
updating other snapshot entries if an entry contains any successful shards due to possible dependencies.
This change reproduces two issues resulting from simply removing snapshot without regard for other queued
operations and fixes them by having all removal of snapshot from the cluster state go through the same
code path.
Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing
snapshots that forces them to be in this collection.
original-brownbear added a commit that referenced this pull request Aug 17, 2021
With recent fixes it is never correct to simply remove a snapshot from the cluster state without
updating other snapshot entries if an entry contains any successful shards due to possible dependencies.
This change reproduces two issues resulting from simply removing snapshot without regard for other queued
operations and fixes them by having all removal of snapshot from the cluster state go through the same
code path.
Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing
snapshots that forces them to be in this collection.
@original-brownbear original-brownbear restored the fix-failure-edge-case branch April 18, 2023 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.14.1 v7.15.0 v8.0.0-alpha1 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants