Fix Snapshot State Machine Issues around Failed Clones #76419

original-brownbear · 2021-08-12T13:55:42Z

With recent fixes it is never correct to simply remove a snapshot from the cluster state without
updating other snapshot entries if an entry contains any successful shards due to possible dependencies.
This change reproduces two issues resulting from simply removing snapshot without regard for other queued
operations and fixes them by having all removal of snapshot from the cluster state go through the same
code path.
Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing
snapshots that forces them to be in this collection.

With recent fixes it is never correct to simply remove a snapshot from the cluster state without updating other snapshot entries if an entry contains any successful shards due to possible dependencies. This change reproduces two issues resulting from simply removing snapshot without regard for other queued operations and fixes them by having all removal of snapshot from the cluster state go through the same code path. Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing snapshots that forces them to be in this collection.

elasticmachine · 2021-08-12T13:55:45Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2021-08-12T13:58:52Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

        if (entry.isClone() && entry.state() == State.FAILED) {
            logger.debug("Removing failed snapshot clone [{}] from cluster state", entry);
-            removeFailedSnapshotFromClusterState(snapshot, new SnapshotException(snapshot, entry.failure()), null);
+            if (newFinalization) {
+                removeFailedSnapshotFromClusterState(snapshot, new SnapshotException(snapshot, entry.failure()), null);


We could do a possible follow-up here to remove data from partially failed clones from the repo. Since clones only really fail like this due to IO exceptions and since IO exceptions are really unlikely unless something is really broken I'm not sure it's worth the effort though because the cleanup will probably also fail (and would happen on a subsequent delete run). Since we aren't writing a new index-N (so this isn't relevant to the scalability of the repo really) and clones by their very nature add almost no bytes to the repo I think for now this is good enough.

original-brownbear · 2021-08-12T14:00:07Z

test/framework/src/main/java/org/elasticsearch/snapshots/mockstore/MockRepository.java

        blockOnWriteShardLevelMeta = true;
    }

+    public void setBlockAndFailOnWriteShardLevelMeta() {


Not great to add yet another method to this, but this case turns out to be important I guess. Reproducing this like we used to via blocking and then restarting master simply did not cover this obviously broken spot.

DaveCTurner

LGTM but a question about test coverage

DaveCTurner · 2021-08-12T15:41:11Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

        if (entry.isClone() && entry.state() == State.FAILED) {
            logger.debug("Removing failed snapshot clone [{}] from cluster state", entry);
-            removeFailedSnapshotFromClusterState(snapshot, new SnapshotException(snapshot, entry.failure()), null);
+            if (newFinalization) {


Do we have a test for this new condition? I deleted it and ran a few likely-looking suites and also all of :server:ictest but didn't see any failures.

We do not, I'm not sure how I could deterministically reproduce this at the moment. You can technically get here in some master fails over twice in a row scenarios. Though that would need to happen concurrently with finalizing a snapshot queued before the clone ... I'll think about a way to set this up, just added this here for now to be defensive and not create pointless CS updates (technically these updates should be idempotent anyway because they become noops as soon as the snapshot to be removed isn't in the CS).

original-brownbear · 2021-08-12T18:41:36Z

Thanks David!

With recent fixes it is never correct to simply remove a snapshot from the cluster state without updating other snapshot entries if an entry contains any successful shards due to possible dependencies. This change reproduces two issues resulting from simply removing snapshot without regard for other queued operations and fixes them by having all removal of snapshot from the cluster state go through the same code path. Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing snapshots that forces them to be in this collection.

original-brownbear added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.14.1 v7.15.0 v8.0.0-alpha1 labels Aug 12, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 12, 2021

original-brownbear commented Aug 12, 2021

View reviewed changes

original-brownbear requested review from DaveCTurner and tlrx August 12, 2021 14:58

DaveCTurner approved these changes Aug 12, 2021

View reviewed changes

original-brownbear merged commit 1f080e3 into elastic:master Aug 12, 2021

original-brownbear deleted the fix-failure-edge-case branch August 12, 2021 18:41

original-brownbear added the backport pending label Aug 12, 2021

original-brownbear mentioned this pull request Aug 17, 2021

Fix Snapshot State Machine Issues around Failed Clones (#76419) #76603

Merged

original-brownbear mentioned this pull request Aug 17, 2021

Fix Snapshot State Machine Issues around Failed Clones (#76419) #76604

Merged

original-brownbear removed the backport pending label Aug 17, 2021

jakelandis added v8.0.0-alpha2 and removed v8.0.0 labels Sep 15, 2021

original-brownbear restored the fix-failure-edge-case branch April 18, 2023 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Snapshot State Machine Issues around Failed Clones #76419

Fix Snapshot State Machine Issues around Failed Clones #76419

original-brownbear commented Aug 12, 2021

elasticmachine commented Aug 12, 2021

original-brownbear Aug 12, 2021

original-brownbear Aug 12, 2021

DaveCTurner left a comment

DaveCTurner Aug 12, 2021

original-brownbear Aug 12, 2021

original-brownbear commented Aug 12, 2021

Fix Snapshot State Machine Issues around Failed Clones #76419

Fix Snapshot State Machine Issues around Failed Clones #76419

Conversation

original-brownbear commented Aug 12, 2021

elasticmachine commented Aug 12, 2021

original-brownbear Aug 12, 2021

Choose a reason for hiding this comment

original-brownbear Aug 12, 2021

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Aug 12, 2021

Choose a reason for hiding this comment

original-brownbear Aug 12, 2021

Choose a reason for hiding this comment

original-brownbear commented Aug 12, 2021