-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for snapshot completion in SLM snapshot invocation #47051
Conversation
This changes the snapshots internally invoked by SLM to wait for completion. This allows us to capture more snapshotting failure scenarios. For example, previously a snapshot would be created and then registered as a "success", however, the snapshot may have been aborted, or it may have had a subset of its shards fail. These cases are now handled by inspecting the response to the `CreateSnapshotRequest` and ensuring that there are no failures. If any failures are present, the history store now stores the action as a failure instead of a success. Relates to elastic#38461 and elastic#43663
Pinging @elastic/es-core-features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, all either minor or thoughts about the future that we don't need to address right now. To be clear, I'm good with merging this as-is.
elapsedDeletionTime, maximumTime, deleted, count, failed); | ||
slmStats.deletionTime(elapsedDeletionTime); | ||
totalDeletionTime, maximumTime, deleted, count, failed); | ||
slmStats.deletionTime(totalDeletionTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes aren't really related to the main purpose of this PR. They're very minor so I think it's fine, but going forward I think we should try to keep PRs focused on one conceptual change - this case bugs me a little because this PR otherwise doesn't really touch retention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it bugs me a little too, I'll separate them in the future.
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java
Show resolved
Hide resolved
// Add each failed shard's exception as suppressed | ||
snapInfo.shardFailures().forEach(failure -> e.addSuppressed(failure.getCause())); | ||
// Call the failure handler to register this as a failure and persist it | ||
onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given what Armin was saying on #46988 about partial snapshots sometimes being more like successes, treating them here as a failure may be awkward for some users. That said, I think we can go with it for now and tweak the behavior later if we get strong feedback about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in order to better handle it, we may have to change the history store to store all the states that a snapshot can be in, that way we could still tell if there were errors (PARTIAL snapshots). For now though, I think it's safer to treat PARTIAL snapshots as failures.
* Wait for snapshot completion in SLM snapshot invocation This changes the snapshots internally invoked by SLM to wait for completion. This allows us to capture more snapshotting failure scenarios. For example, previously a snapshot would be created and then registered as a "success", however, the snapshot may have been aborted, or it may have had a subset of its shards fail. These cases are now handled by inspecting the response to the `CreateSnapshotRequest` and ensuring that there are no failures. If any failures are present, the history store now stores the action as a failure instead of a success. Relates to #38461 and #43663
This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.
For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the
CreateSnapshotRequest
and ensuring thatthere are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.
This also fixes an issue where the logging was reporting something
incorrectly, and fixes the case where the SLM duration was not
dynamically updateable.
Relates to #38461 and #43663