Retention for Snapshot Lifecycle Management #43663

dakrone · 2019-06-26T21:17:42Z

SLM as a standalone snapshot taking tool is taking shape as described in #38461. However, to fully utilize SLM, we should implement retention for the snapshots that SLM takes.

Policy definition would change to something like:

PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}

Snapshot retention would kick in based on a schedule (supporting cron expressions) and configured with the newly introduced slm.retention_schedule cluster setting. This would allow administrators to configure when snapshots are deleted (so as not to interfere with other cluster operations).

Potentially, SLM retention would need to cap the amount of time spent deleting snapshots (probably with another cluster setting) so long-running deletes don't cause issues with other cluster operations.

Potential list of snapshot conditions:

age-based retention (delete snapshots after N days)
minimum number of snapshots to keep
maximum number of snapshots to allow (delete oldest if there are too many)

Some things to work out

What should we do with FAILED/PARTIAL snapshots? Should they be treated as subject to retention? Separate retention?

For the first release, treating PARTIAL as failed and not eligible for retention

Are there retry policies for deletion, or should we wait for the next invocation of the retention task
Does the order of old snapshot deletion matter?

Oldest snapshots will be deleted first

Task Checklist

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-06-26T21:17:45Z

Pinging @elastic/es-core-features

dakrone · 2019-06-26T21:18:21Z

/cc @cjcenizal @jen-huang

This commit adds the `SnapshotRetentionConfiguration` class and its HLRC counterpart to encapsulate the configuration for SLM retention. Currently only a single parameter is supported as an example (we still need to discuss the different options we want to support and their names) to keep the size of the PR down. It also does not yet include version serialization checks since the original SLM branch has not yet been merged. Relates to elastic#43663

* Add SnapshotRetentionConfiguration for retention configuration This commit adds the `SnapshotRetentionConfiguration` class and its HLRC counterpart to encapsulate the configuration for SLM retention. Currently only a single parameter is supported as an example (we still need to discuss the different options we want to support and their names) to keep the size of the PR down. It also does not yet include version serialization checks since the original SLM branch has not yet been merged. Relates to #43663 * Fix REST tests * Fix more documentation * Use Objects.equals to avoid NPE * Put `randomSnapshotLifecyclePolicy` in only one place * Occasionally return retention with no configuration

This commit implements the snapshot filtering and deletion for `SnapshotRetentionTask`. Currently only the expire-after age is used for determining whether a snapshot is eligible for deletion. Relates to elastic#43663

* Implement SnapshotRetentionTask's snapshot filtering and deletion This commit implements the snapshot filtering and deletion for `SnapshotRetentionTask`. Currently only the expire-after age is used for determining whether a snapshot is eligible for deletion. Relates to #43663 * Fix deletes running on the wrong thread * Handle missing or null policy in snap metadata differently * Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>> * Use the `OriginSettingClient` to work with security, enhance logging * Prevent NPE in test by mocking Client

This adds the configuration options for `min_count` and `max_count` as well as the logic for determining whether a snapshot meets this criteria to SLM's retention feature. These options are optional and one, two, or all three can be specified in an SLM policy. Relates to elastic#43663

Semi-related to elastic#44465, this allows the `"retention"` configuration map to be missing. Relates to elastic#43663

Semi-related to #44465, this allows the `"retention"` configuration map to be missing. Relates to #43663

This adds the configuration options for `min_count` and `max_count` as well as the logic for determining whether a snapshot meets this criteria to SLM's retention feature. These options are optional and one, two, or all three can be specified in an SLM policy. Relates to #43663

With a cluster that has a large number of snapshots, it's possible that snapshot deletion can take a very long time (especially since deletes currently have to happen in a serial fashion). To prevent snapshot deletion from taking forever in a cluster and blocking other operations, this commit adds a setting to allow configuring a maximum time to spend deletion snapshots during retention. This dynamic setting defaults to 1 hour and is best-effort, meaning that it doesn't hard stop a deletion at an hour mark, but ensures that once the time has passed, all subsequent deletions are deferred until the next retention cycle. Relates to elastic#43663

* Time-bound deletion of snapshots in retention delete function With a cluster that has a large number of snapshots, it's possible that snapshot deletion can take a very long time (especially since deletes currently have to happen in a serial fashion). To prevent snapshot deletion from taking forever in a cluster and blocking other operations, this commit adds a setting to allow configuring a maximum time to spend deletion snapshots during retention. This dynamic setting defaults to 1 hour and is best-effort, meaning that it doesn't hard stop a deletion at an hour mark, but ensures that once the time has passed, all subsequent deletions are deferred until the next retention cycle. Relates to #43663 * Wow snapshots suuuure can take a long time. * Use a LongSupplier instead of actually sleeping * Remove TestLogging annotation * Remove rate limiting

This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The stats stored include the number of snapshots taken, failed, deleted, the number of retention runs, as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount of time spent deleting snapshots from SLM retention. This commit also adds an endpoint for retrieving all stats (further commits will expose this in the SLM get-policy API) that looks like: ``` GET /_slm/stats { "retention_runs" : 13, "retention_failed" : 0, "retention_timed_out" : 0, "retention_deletion_time" : "1.4s", "retention_deletion_time_millis" : 1404, "policy_metrics" : { "daily-snapshots2" : { "snapshots_taken" : 7, "snapshots_failed" : 0, "snapshots_deleted" : 6, "snapshot_deletion_failures" : 0 }, "daily-snapshots" : { "snapshots_taken" : 12, "snapshots_failed" : 0, "snapshots_deleted" : 12, "snapshot_deletion_failures" : 6 } }, "total_snapshots_taken" : 19, "total_snapshots_failed" : 0, "total_snapshots_deleted" : 18, "total_snapshot_deletion_failures" : 6 } ``` This does not yet include HLRC for this, as this commit is quite large on its own. That will be added in a subsequent commit. Relates to elastic#43663

* Add SLM metrics gathering and endpoint This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The stats stored include the number of snapshots taken, failed, deleted, the number of retention runs, as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount of time spent deleting snapshots from SLM retention. This commit also adds an endpoint for retrieving all stats (further commits will expose this in the SLM get-policy API) that looks like: ``` GET /_slm/stats { "retention_runs" : 13, "retention_failed" : 0, "retention_timed_out" : 0, "retention_deletion_time" : "1.4s", "retention_deletion_time_millis" : 1404, "policy_metrics" : { "daily-snapshots2" : { "snapshots_taken" : 7, "snapshots_failed" : 0, "snapshots_deleted" : 6, "snapshot_deletion_failures" : 0 }, "daily-snapshots" : { "snapshots_taken" : 12, "snapshots_failed" : 0, "snapshots_deleted" : 12, "snapshot_deletion_failures" : 6 } }, "total_snapshots_taken" : 19, "total_snapshots_failed" : 0, "total_snapshots_deleted" : 18, "total_snapshot_deletion_failures" : 6 } ``` This does not yet include HLRC for this, as this commit is quite large on its own. That will be added in a subsequent commit. Relates to #43663 * Version qualify serialization * Initialize counters outside constructor * Use computeIfAbsent instead of being too verbose * Move part of XContent generation into subclass * Fix REST action for master merge * Unused import

This adds a default for the `slm.retention_schedule` setting, setting it to `0 30 1 * * ?` which is 1:30am every day. Having retention unset meant that it would never be invoked and clean up snapshots. We determined it would be better to have a default than never to be run. When coming to a decision, we weighed the option of an absolute time (such as 1:30am) versus a periodic invocation (like every 12 hours). In the end we decided on the absolute time because it has better predictability and consistency than a periodic invocation, which would rely on when the master node were elected or restarted. Relates to elastic#43663

This adds a default for the `slm.retention_schedule` setting, setting it to `0 30 1 * * ?` which is 1:30am every day. Having retention unset meant that it would never be invoked and clean up snapshots. We determined it would be better to have a default than never to be run. When coming to a decision, we weighed the option of an absolute time (such as 1:30am) versus a periodic invocation (like every 12 hours). In the end we decided on the absolute time because it has better predictability and consistency than a periodic invocation, which would rely on when the master node were elected or restarted. Relates to #43663

This enhances the existing SLM test using users/roles/etc to also test that SLM retention works when security is enabled. Relates to elastic#43663

This adds a default for the `slm.retention_schedule` setting, setting it to `0 30 1 * * ?` which is 1:30am every day. Having retention unset meant that it would never be invoked and clean up snapshots. We determined it would be better to have a default than never to be run. When coming to a decision, we weighed the option of an absolute time (such as 1:30am) versus a periodic invocation (like every 12 hours). In the end we decided on the absolute time because it has better predictability and consistency than a periodic invocation, which would rely on when the master node were elected or restarted. Relates to #43663

This separates a start/stop/status API for SLM from being tied to ILM's operation mode. These APIs look like: ``` POST /_slm/stop POST /_slm/start GET /_slm/status ``` This allows administrators to have fine-grained control over preventing periodic snapshots and deletions while performing cluster maintenance. Relates to elastic#43663

This enhances the existing SLM test using users/roles/etc to also test that SLM retention works when security is enabled. Relates to #43663

* Add Snapshot Lifecycle Retention documentation This commits adds API and general purpose documentation for SLM retention. Relates to #43663 * Fix docs tests * Update default now that #47604 has been merged * Update docs/reference/ilm/apis/slm-api.asciidoc Co-Authored-By: Gordon Brown <[email protected]> * Update docs/reference/ilm/apis/slm-api.asciidoc Co-Authored-By: Gordon Brown <[email protected]> * Update docs with feedback

* Separate SLM stop/start/status API from ILM This separates a start/stop/status API for SLM from being tied to ILM's operation mode. These APIs look like: ``` POST /_slm/stop POST /_slm/start GET /_slm/status ``` This allows administrators to have fine-grained control over preventing periodic snapshots and deletions while performing cluster maintenance. Relates to #43663 * Allow going from RUNNING to STOPPED * Align with the OperationMode rules * Fix slmStopping method * Make OperationModeUpdateTask constructor private * Wipe snapshots better in test

This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to elastic#43663

* Add SLM support to xpack usage and info APIs This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to #43663 * Fix test expectation * Fix docs test

This is a backport of elastic#48096 This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to elastic#43663

* Add SLM support to xpack usage and info APIs This is a backport of #48096 This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to #43663 * Fix missing license

dakrone · 2019-10-21T14:41:57Z

I'm going to close this for now as this has been merged and backported for release in 7.5+. We can track further work in separate issues.

dakrone added >feature :Data Management/ILM+SLM Index and Snapshot lifecycle management 7x labels Jun 26, 2019

dakrone assigned dakrone and gwbrown Jun 26, 2019

dakrone mentioned this issue Jun 26, 2019

Snapshot lifecycle management #38461

Closed

19 tasks

dakrone mentioned this issue Jun 28, 2019

Add SnapshotRetentionConfiguration for retention configuration #43777

Merged

gwbrown mentioned this issue Jul 5, 2019

Add Snapshot Lifecycle Management #43934

Merged

dakrone mentioned this issue Jul 23, 2019

Implement SnapshotRetentionTask's snapshot filtering and deletion #44764

Merged

dakrone mentioned this issue Jul 26, 2019

Add min_count and max_count as SLM retention predicates #44926

Merged

dakrone added a commit to dakrone/elasticsearch that referenced this issue Jul 30, 2019

Allow empty/missing SLM retention configuration

5ba2fcc

Semi-related to elastic#44465, this allows the `"retention"` configuration map to be missing. Relates to elastic#43663

dakrone mentioned this issue Jul 30, 2019

Allow empty/missing SLM retention configuration #45018

Merged

dakrone added a commit that referenced this issue Jul 31, 2019

Allow empty/missing SLM retention configuration (#45018)

67ff2ef

Semi-related to #44465, this allows the `"retention"` configuration map to be missing. Relates to #43663

This was referenced Jul 31, 2019

Time-bound deletion of snapshots in retention delete function #45065

Merged

Implement retention of snapshots based on the document's timestamp date #45252

Closed

dakrone mentioned this issue Aug 8, 2019

Add SLM metrics gathering and endpoint #45362

Merged

gwbrown mentioned this issue Aug 13, 2019

Record history of SLM retention actions #45513

Merged

dakrone mentioned this issue Oct 3, 2019

Add Snapshot Lifecycle Retention documentation #47545

Merged

dakrone mentioned this issue Oct 4, 2019

Set default SLM retention invocation time #47604

Merged

dakrone added a commit to dakrone/elasticsearch that referenced this issue Oct 4, 2019

Add a test for SLM retention with security enabled

03c86e9

This enhances the existing SLM test using users/roles/etc to also test that SLM retention works when security is enabled. Relates to elastic#43663

dakrone mentioned this issue Oct 4, 2019

Add a test for SLM retention with security enabled #47608

Merged

dakrone mentioned this issue Oct 7, 2019

Separate SLM stop/start/status API from ILM #47710

Merged

dakrone added a commit that referenced this issue Oct 8, 2019

Add a test for SLM retention with security enabled (#47608)

e0c2ac1

This enhances the existing SLM test using users/roles/etc to also test that SLM retention works when security is enabled. Relates to #43663

dakrone added a commit that referenced this issue Oct 8, 2019

Add a test for SLM retention with security enabled (#47608)

906be45

This enhances the existing SLM test using users/roles/etc to also test that SLM retention works when security is enabled. Relates to #43663

This was referenced Oct 9, 2019

Manage retention of partial snapshots in SLM (Simple version) #47833

Merged

SLM Start/Stop HLRC and docs #47966

Merged

dakrone mentioned this issue Oct 15, 2019

Add SLM support to xpack usage and info APIs #48096

Merged

dakrone mentioned this issue Oct 16, 2019

Add SLM support to xpack usage and info APIs #48149

Merged

dakrone mentioned this issue Oct 16, 2019

Add SLM support to xpack usage and info APIs #48150

Merged

dakrone closed this as completed Oct 21, 2019

Mpdreamz mentioned this issue Nov 19, 2019

[meta] 7.5 release elastic/elasticsearch-net#4232

Closed

24 tasks

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retention for Snapshot Lifecycle Management #43663

Retention for Snapshot Lifecycle Management #43663

dakrone commented Jun 26, 2019 •

edited by gwbrown

Loading

elasticmachine commented Jun 26, 2019

dakrone commented Jun 26, 2019

dakrone commented Oct 21, 2019

Retention for Snapshot Lifecycle Management #43663

Retention for Snapshot Lifecycle Management #43663

Comments

dakrone commented Jun 26, 2019 • edited by gwbrown Loading

Task Checklist

elasticmachine commented Jun 26, 2019

dakrone commented Jun 26, 2019

dakrone commented Oct 21, 2019

dakrone commented Jun 26, 2019 •

edited by gwbrown

Loading