Wait for snapshot completion in SLM snapshot invocation #47051

dakrone · 2019-09-24T21:20:57Z

This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the CreateSnapshotRequest and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

This also fixes an issue where the logging was reporting something
incorrectly, and fixes the case where the SLM duration was not
dynamically updateable.

Relates to #38461 and #43663

This changes the snapshots internally invoked by SLM to wait for completion. This allows us to capture more snapshotting failure scenarios. For example, previously a snapshot would be created and then registered as a "success", however, the snapshot may have been aborted, or it may have had a subset of its shards fail. These cases are now handled by inspecting the response to the `CreateSnapshotRequest` and ensuring that there are no failures. If any failures are present, the history store now stores the action as a failure instead of a success. Relates to elastic#38461 and elastic#43663

elasticmachine · 2019-09-24T21:20:59Z

Pinging @elastic/es-core-features

gwbrown

Left some comments, all either minor or thoughts about the future that we don't need to address right now. To be clear, I'm good with merging this as-is.

gwbrown · 2019-09-24T22:53:25Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotRetentionTask.java

-                        elapsedDeletionTime, maximumTime, deleted, count, failed);
-                    slmStats.deletionTime(elapsedDeletionTime);
+                        totalDeletionTime, maximumTime, deleted, count, failed);
+                    slmStats.deletionTime(totalDeletionTime);


These changes aren't really related to the main purpose of this PR. They're very minor so I think it's fine, but going forward I think we should try to keep PRs focused on one conceptual change - this case bugs me a little because this PR otherwise doesn't really touch retention.

Yeah, it bugs me a little too, I'll separate them in the future.

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

gwbrown · 2019-09-24T23:14:48Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

+                        // Add each failed shard's exception as suppressed
+                        snapInfo.shardFailures().forEach(failure -> e.addSuppressed(failure.getCause()));
+                        // Call the failure handler to register this as a failure and persist it
+                        onFailure(e);


Given what Armin was saying on #46988 about partial snapshots sometimes being more like successes, treating them here as a failure may be awkward for some users. That said, I think we can go with it for now and tweak the behavior later if we get strong feedback about it.

I think in order to better handle it, we may have to change the history store to store all the states that a snapshot can be in, that way we could still tell if there were errors (PARTIAL snapshots). For now though, I think it's safer to treat PARTIAL snapshots as failures.

* Wait for snapshot completion in SLM snapshot invocation This changes the snapshots internally invoked by SLM to wait for completion. This allows us to capture more snapshotting failure scenarios. For example, previously a snapshot would be created and then registered as a "success", however, the snapshot may have been aborted, or it may have had a subset of its shards fail. These cases are now handled by inspecting the response to the `CreateSnapshotRequest` and ensuring that there are no failures. If any failures are present, the history store now stores the action as a failure instead of a success. Relates to #38461 and #43663

dakrone added :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.5.0 labels Sep 24, 2019

dakrone requested a review from gwbrown September 24, 2019 21:20

gwbrown approved these changes Sep 24, 2019

View reviewed changes

Enhance comment about exceptions added

7fbcc32

dakrone merged commit 4527824 into elastic:master Sep 25, 2019

dakrone deleted the slm-wait-for-completion branch September 25, 2019 20:24

jimczi added the >enhancement label Nov 12, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

joegallo added a commit to joegallo/elasticsearch that referenced this pull request Sep 15, 2022

Update this comment, see elastic#47051

780c6ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for snapshot completion in SLM snapshot invocation #47051

Wait for snapshot completion in SLM snapshot invocation #47051

dakrone commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

gwbrown left a comment

gwbrown Sep 24, 2019

dakrone Sep 25, 2019

gwbrown Sep 24, 2019

dakrone Sep 25, 2019

Wait for snapshot completion in SLM snapshot invocation #47051

Wait for snapshot completion in SLM snapshot invocation #47051

Conversation

dakrone commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

gwbrown left a comment

Choose a reason for hiding this comment

gwbrown Sep 24, 2019

Choose a reason for hiding this comment

dakrone Sep 25, 2019

Choose a reason for hiding this comment

gwbrown Sep 24, 2019

Choose a reason for hiding this comment

dakrone Sep 25, 2019

Choose a reason for hiding this comment