Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for snapshot completion in SLM snapshot invocation #47051

Merged
merged 2 commits into from
Sep 25, 2019

Conversation

dakrone
Copy link
Member

@dakrone dakrone commented Sep 24, 2019

This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the CreateSnapshotRequest and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

This also fixes an issue where the logging was reporting something
incorrectly, and fixes the case where the SLM duration was not
dynamically updateable.

Relates to #38461 and #43663

This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

Relates to elastic#38461 and elastic#43663
@dakrone dakrone added :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.5.0 labels Sep 24, 2019
@dakrone dakrone requested a review from gwbrown September 24, 2019 21:20
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

Copy link
Contributor

@gwbrown gwbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, all either minor or thoughts about the future that we don't need to address right now. To be clear, I'm good with merging this as-is.

elapsedDeletionTime, maximumTime, deleted, count, failed);
slmStats.deletionTime(elapsedDeletionTime);
totalDeletionTime, maximumTime, deleted, count, failed);
slmStats.deletionTime(totalDeletionTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes aren't really related to the main purpose of this PR. They're very minor so I think it's fine, but going forward I think we should try to keep PRs focused on one conceptual change - this case bugs me a little because this PR otherwise doesn't really touch retention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it bugs me a little too, I'll separate them in the future.

// Add each failed shard's exception as suppressed
snapInfo.shardFailures().forEach(failure -> e.addSuppressed(failure.getCause()));
// Call the failure handler to register this as a failure and persist it
onFailure(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given what Armin was saying on #46988 about partial snapshots sometimes being more like successes, treating them here as a failure may be awkward for some users. That said, I think we can go with it for now and tweak the behavior later if we get strong feedback about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in order to better handle it, we may have to change the history store to store all the states that a snapshot can be in, that way we could still tell if there were errors (PARTIAL snapshots). For now though, I think it's safer to treat PARTIAL snapshots as failures.

@dakrone dakrone merged commit 4527824 into elastic:master Sep 25, 2019
@dakrone dakrone deleted the slm-wait-for-completion branch September 25, 2019 20:24
dakrone added a commit that referenced this pull request Sep 25, 2019
* Wait for snapshot completion in SLM snapshot invocation

This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

Relates to #38461 and #43663
joegallo added a commit to joegallo/elasticsearch that referenced this pull request Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants