-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retention for Snapshot Lifecycle Management #43663
Labels
Comments
dakrone
added
>feature
:Data Management/ILM+SLM
Index and Snapshot lifecycle management
7x
labels
Jun 26, 2019
Pinging @elastic/es-core-features |
/cc @cjcenizal @jen-huang |
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Jun 28, 2019
This commit adds the `SnapshotRetentionConfiguration` class and its HLRC counterpart to encapsulate the configuration for SLM retention. Currently only a single parameter is supported as an example (we still need to discuss the different options we want to support and their names) to keep the size of the PR down. It also does not yet include version serialization checks since the original SLM branch has not yet been merged. Relates to elastic#43663
dakrone
added a commit
that referenced
this issue
Jul 15, 2019
* Add SnapshotRetentionConfiguration for retention configuration This commit adds the `SnapshotRetentionConfiguration` class and its HLRC counterpart to encapsulate the configuration for SLM retention. Currently only a single parameter is supported as an example (we still need to discuss the different options we want to support and their names) to keep the size of the PR down. It also does not yet include version serialization checks since the original SLM branch has not yet been merged. Relates to #43663 * Fix REST tests * Fix more documentation * Use Objects.equals to avoid NPE * Put `randomSnapshotLifecyclePolicy` in only one place * Occasionally return retention with no configuration
dakrone
added a commit
that referenced
this issue
Jul 17, 2019
* Add SnapshotRetentionConfiguration for retention configuration This commit adds the `SnapshotRetentionConfiguration` class and its HLRC counterpart to encapsulate the configuration for SLM retention. Currently only a single parameter is supported as an example (we still need to discuss the different options we want to support and their names) to keep the size of the PR down. It also does not yet include version serialization checks since the original SLM branch has not yet been merged. Relates to #43663 * Fix REST tests * Fix more documentation * Use Objects.equals to avoid NPE * Put `randomSnapshotLifecyclePolicy` in only one place * Occasionally return retention with no configuration
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Jul 23, 2019
This commit implements the snapshot filtering and deletion for `SnapshotRetentionTask`. Currently only the expire-after age is used for determining whether a snapshot is eligible for deletion. Relates to elastic#43663
dakrone
added a commit
that referenced
this issue
Jul 25, 2019
* Implement SnapshotRetentionTask's snapshot filtering and deletion This commit implements the snapshot filtering and deletion for `SnapshotRetentionTask`. Currently only the expire-after age is used for determining whether a snapshot is eligible for deletion. Relates to #43663 * Fix deletes running on the wrong thread * Handle missing or null policy in snap metadata differently * Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>> * Use the `OriginSettingClient` to work with security, enhance logging * Prevent NPE in test by mocking Client
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Jul 26, 2019
This adds the configuration options for `min_count` and `max_count` as well as the logic for determining whether a snapshot meets this criteria to SLM's retention feature. These options are optional and one, two, or all three can be specified in an SLM policy. Relates to elastic#43663
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Jul 30, 2019
Semi-related to elastic#44465, this allows the `"retention"` configuration map to be missing. Relates to elastic#43663
dakrone
added a commit
that referenced
this issue
Jul 31, 2019
dakrone
added a commit
that referenced
this issue
Jul 31, 2019
This adds the configuration options for `min_count` and `max_count` as well as the logic for determining whether a snapshot meets this criteria to SLM's retention feature. These options are optional and one, two, or all three can be specified in an SLM policy. Relates to #43663
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Jul 31, 2019
With a cluster that has a large number of snapshots, it's possible that snapshot deletion can take a very long time (especially since deletes currently have to happen in a serial fashion). To prevent snapshot deletion from taking forever in a cluster and blocking other operations, this commit adds a setting to allow configuring a maximum time to spend deletion snapshots during retention. This dynamic setting defaults to 1 hour and is best-effort, meaning that it doesn't hard stop a deletion at an hour mark, but ensures that once the time has passed, all subsequent deletions are deferred until the next retention cycle. Relates to elastic#43663
This was referenced Jul 31, 2019
dakrone
added a commit
that referenced
this issue
Aug 7, 2019
* Time-bound deletion of snapshots in retention delete function With a cluster that has a large number of snapshots, it's possible that snapshot deletion can take a very long time (especially since deletes currently have to happen in a serial fashion). To prevent snapshot deletion from taking forever in a cluster and blocking other operations, this commit adds a setting to allow configuring a maximum time to spend deletion snapshots during retention. This dynamic setting defaults to 1 hour and is best-effort, meaning that it doesn't hard stop a deletion at an hour mark, but ensures that once the time has passed, all subsequent deletions are deferred until the next retention cycle. Relates to #43663 * Wow snapshots suuuure can take a long time. * Use a LongSupplier instead of actually sleeping * Remove TestLogging annotation * Remove rate limiting
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Aug 8, 2019
This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The stats stored include the number of snapshots taken, failed, deleted, the number of retention runs, as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount of time spent deleting snapshots from SLM retention. This commit also adds an endpoint for retrieving all stats (further commits will expose this in the SLM get-policy API) that looks like: ``` GET /_slm/stats { "retention_runs" : 13, "retention_failed" : 0, "retention_timed_out" : 0, "retention_deletion_time" : "1.4s", "retention_deletion_time_millis" : 1404, "policy_metrics" : { "daily-snapshots2" : { "snapshots_taken" : 7, "snapshots_failed" : 0, "snapshots_deleted" : 6, "snapshot_deletion_failures" : 0 }, "daily-snapshots" : { "snapshots_taken" : 12, "snapshots_failed" : 0, "snapshots_deleted" : 12, "snapshot_deletion_failures" : 6 } }, "total_snapshots_taken" : 19, "total_snapshots_failed" : 0, "total_snapshots_deleted" : 18, "total_snapshot_deletion_failures" : 6 } ``` This does not yet include HLRC for this, as this commit is quite large on its own. That will be added in a subsequent commit. Relates to elastic#43663
dakrone
added a commit
that referenced
this issue
Aug 13, 2019
* Add SLM metrics gathering and endpoint This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The stats stored include the number of snapshots taken, failed, deleted, the number of retention runs, as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount of time spent deleting snapshots from SLM retention. This commit also adds an endpoint for retrieving all stats (further commits will expose this in the SLM get-policy API) that looks like: ``` GET /_slm/stats { "retention_runs" : 13, "retention_failed" : 0, "retention_timed_out" : 0, "retention_deletion_time" : "1.4s", "retention_deletion_time_millis" : 1404, "policy_metrics" : { "daily-snapshots2" : { "snapshots_taken" : 7, "snapshots_failed" : 0, "snapshots_deleted" : 6, "snapshot_deletion_failures" : 0 }, "daily-snapshots" : { "snapshots_taken" : 12, "snapshots_failed" : 0, "snapshots_deleted" : 12, "snapshot_deletion_failures" : 6 } }, "total_snapshots_taken" : 19, "total_snapshots_failed" : 0, "total_snapshots_deleted" : 18, "total_snapshot_deletion_failures" : 6 } ``` This does not yet include HLRC for this, as this commit is quite large on its own. That will be added in a subsequent commit. Relates to #43663 * Version qualify serialization * Initialize counters outside constructor * Use computeIfAbsent instead of being too verbose * Move part of XContent generation into subclass * Fix REST action for master merge * Unused import
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Oct 4, 2019
This adds a default for the `slm.retention_schedule` setting, setting it to `0 30 1 * * ?` which is 1:30am every day. Having retention unset meant that it would never be invoked and clean up snapshots. We determined it would be better to have a default than never to be run. When coming to a decision, we weighed the option of an absolute time (such as 1:30am) versus a periodic invocation (like every 12 hours). In the end we decided on the absolute time because it has better predictability and consistency than a periodic invocation, which would rely on when the master node were elected or restarted. Relates to elastic#43663
dakrone
added a commit
that referenced
this issue
Oct 4, 2019
This adds a default for the `slm.retention_schedule` setting, setting it to `0 30 1 * * ?` which is 1:30am every day. Having retention unset meant that it would never be invoked and clean up snapshots. We determined it would be better to have a default than never to be run. When coming to a decision, we weighed the option of an absolute time (such as 1:30am) versus a periodic invocation (like every 12 hours). In the end we decided on the absolute time because it has better predictability and consistency than a periodic invocation, which would rely on when the master node were elected or restarted. Relates to #43663
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Oct 4, 2019
This enhances the existing SLM test using users/roles/etc to also test that SLM retention works when security is enabled. Relates to elastic#43663
dakrone
added a commit
that referenced
this issue
Oct 4, 2019
This adds a default for the `slm.retention_schedule` setting, setting it to `0 30 1 * * ?` which is 1:30am every day. Having retention unset meant that it would never be invoked and clean up snapshots. We determined it would be better to have a default than never to be run. When coming to a decision, we weighed the option of an absolute time (such as 1:30am) versus a periodic invocation (like every 12 hours). In the end we decided on the absolute time because it has better predictability and consistency than a periodic invocation, which would rely on when the master node were elected or restarted. Relates to #43663
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Oct 7, 2019
This separates a start/stop/status API for SLM from being tied to ILM's operation mode. These APIs look like: ``` POST /_slm/stop POST /_slm/start GET /_slm/status ``` This allows administrators to have fine-grained control over preventing periodic snapshots and deletions while performing cluster maintenance. Relates to elastic#43663
dakrone
added a commit
that referenced
this issue
Oct 8, 2019
This enhances the existing SLM test using users/roles/etc to also test that SLM retention works when security is enabled. Relates to #43663
dakrone
added a commit
that referenced
this issue
Oct 8, 2019
This enhances the existing SLM test using users/roles/etc to also test that SLM retention works when security is enabled. Relates to #43663
dakrone
added a commit
that referenced
this issue
Oct 8, 2019
* Add Snapshot Lifecycle Retention documentation This commits adds API and general purpose documentation for SLM retention. Relates to #43663 * Fix docs tests * Update default now that #47604 has been merged * Update docs/reference/ilm/apis/slm-api.asciidoc Co-Authored-By: Gordon Brown <[email protected]> * Update docs/reference/ilm/apis/slm-api.asciidoc Co-Authored-By: Gordon Brown <[email protected]> * Update docs with feedback
dakrone
added a commit
that referenced
this issue
Oct 8, 2019
* Add Snapshot Lifecycle Retention documentation This commits adds API and general purpose documentation for SLM retention. Relates to #43663 * Fix docs tests * Update default now that #47604 has been merged * Update docs/reference/ilm/apis/slm-api.asciidoc Co-Authored-By: Gordon Brown <[email protected]> * Update docs/reference/ilm/apis/slm-api.asciidoc Co-Authored-By: Gordon Brown <[email protected]> * Update docs with feedback
dakrone
added a commit
that referenced
this issue
Oct 8, 2019
* Separate SLM stop/start/status API from ILM This separates a start/stop/status API for SLM from being tied to ILM's operation mode. These APIs look like: ``` POST /_slm/stop POST /_slm/start GET /_slm/status ``` This allows administrators to have fine-grained control over preventing periodic snapshots and deletions while performing cluster maintenance. Relates to #43663 * Allow going from RUNNING to STOPPED * Align with the OperationMode rules * Fix slmStopping method * Make OperationModeUpdateTask constructor private * Wipe snapshots better in test
dakrone
added a commit
that referenced
this issue
Oct 8, 2019
* Separate SLM stop/start/status API from ILM This separates a start/stop/status API for SLM from being tied to ILM's operation mode. These APIs look like: ``` POST /_slm/stop POST /_slm/start GET /_slm/status ``` This allows administrators to have fine-grained control over preventing periodic snapshots and deletions while performing cluster maintenance. Relates to #43663 * Allow going from RUNNING to STOPPED * Align with the OperationMode rules * Fix slmStopping method * Make OperationModeUpdateTask constructor private * Wipe snapshots better in test
This was referenced Oct 9, 2019
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Oct 15, 2019
This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to elastic#43663
dakrone
added a commit
that referenced
this issue
Oct 16, 2019
* Add SLM support to xpack usage and info APIs This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to #43663 * Fix test expectation * Fix docs test
dakrone
added a commit
to dakrone/elasticsearch
that referenced
this issue
Oct 16, 2019
This is a backport of elastic#48096 This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to elastic#43663
dakrone
added a commit
that referenced
this issue
Oct 17, 2019
* Add SLM support to xpack usage and info APIs This is a backport of #48096 This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to #43663 * Fix missing license
dakrone
added a commit
that referenced
this issue
Oct 17, 2019
* Add SLM support to xpack usage and info APIs This is a backport of #48096 This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to #43663 * Fix missing license
I'm going to close this for now as this has been merged and backported for release in 7.5+. We can track further work in separate issues. |
24 tasks
This was referenced Feb 3, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
SLM as a standalone snapshot taking tool is taking shape as described in #38461. However, to fully utilize SLM, we should implement retention for the snapshots that SLM takes.
Policy definition would change to something like:
Snapshot retention would kick in based on a schedule (supporting cron expressions) and configured with the newly introduced
slm.retention_schedule
cluster setting. This would allow administrators to configure when snapshots are deleted (so as not to interfere with other cluster operations).Potentially, SLM retention would need to cap the amount of time spent deleting snapshots (probably with another cluster setting) so long-running deletes don't cause issues with other cluster operations.
Potential list of snapshot conditions:
Some things to work out
For the first release, treating PARTIAL as failed and not eligible for retention
Oldest snapshots will be deleted first
Task Checklist
_meta
inCreateSnapshotRequest
(@gwbrown) Add custom metadata to snapshots #41281_meta
associating each snapshot with the policy that created it (@gwbrown) Include SLM policy name in Snapshot metadata #43132slm-retention
) (@dakrone) Add base framework for snapshot retention #43605SnapshotLifecyclePolicy
to support retention configuration (@dakrone) Add SnapshotRetentionConfiguration for retention configuration #43777SnapshotRetentionTask
to implement snapshot deletion (@dakrone) Implement SnapshotRetentionTask's snapshot filtering and deletion #44764SnapshotRetentionConfiguration
predicates (@dakrone) Add min_count and max_count as SLM retention predicates #44926OperationMode
(@dakrone) Skip SLM retention if ILM is STOPPING or STOPPED #45869Investigate retention of data in snapshots based on document/data age (put into snap meta?) instead of snapshot age+~ see: Implement retention of snapshots based on the document's timestamp date #45252FAILURE
andPARTIAL
snapshots Handle retention of failed and partial snapshots in SLM #46988 (@gwbrown) Manage retention of failed snapshots in SLM #47617Add cooldown period in between SLM operations Add a configurable cooldown period between SLM operations #47520 (@dakrone)The text was updated successfully, but these errors were encountered: