Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot lifecycle management #38461

Closed
19 tasks done
dakrone opened this issue Feb 5, 2019 · 15 comments
Closed
19 tasks done

Snapshot lifecycle management #38461

dakrone opened this issue Feb 5, 2019 · 15 comments
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >feature release highlight v7.4.0 v8.0.0-alpha1

Comments

@dakrone
Copy link
Member

dakrone commented Feb 5, 2019

ILM has been included in Elasticsearch, which allows us to manage the lifecycle
of an index, however, this lifecycle management does not currently include
periodic snapshots of the index.

In order to provide a full replacement for other cluster periodic management
tools out there (such as Curator), we should add snapshot management to
Elasticsearch.

Ideally this would fall under the same sort of management than ILM provides, the
difference, however, is that snapshots are multi-index whereas index lifecycle
policies are applied to a single index (and all actions are executed on a single
index).

We need a way of specifying a periodic and/or scheduled snapshots of a given set
of indices using a specific repository, perhaps something like this (all of the
API is made up)

PUT /_slm/policy/snapshot-every-day
{
  // Run this every day at 2:30am
  "schedule": "0 30 2 * * ?",

  // What the snapshot should be named, supporting date-math
  "name": "<production-snap-{now/d}>",

  // Which snapshot repository to use for the snapshot
  "repository": "my-s3-repository",

  // "config" is a map of all the options that the regular snapshot API takes
  "config": {
    "indices": ["foo-*", "important"],
    "ignore_unavailable": true,
    "include_global_state": false
  }
}

Elasticsearch will then manage taking snapshots of the given indices for the
repository on the schedule specified. The status of the snapshots would have to
be stored somewhere, likely in an index (.tasks perhaps?)

Some other things that would be nice (but not required) to support:

  • Snapshots every N minutes. Where N only starts counting from the completion of
    the previous snapshot (for example, a snapshot every 30 minutes that takes 4
    minutes to complete would start a snapshot at 00:00, and then the next would
    be 00:34 - 30 minutes after the completion of the previous snapshot).
  • Retention of snapshots. Specifying something like "max_count": 10 meaning to
    keep the last 10 snapshots, or "max_age": "7d" meaning to keep a weeks'
    worth of snapshots, the old snapshot deletion would be managed by ES.

Task Checklist

@dakrone dakrone added >feature :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Feb 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@jakelandis jakelandis added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Feb 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dakrone dakrone self-assigned this Feb 26, 2019
dakrone added a commit to dakrone/elasticsearch that referenced this issue Mar 12, 2019
This commit adds `SnapshotLifecycleService` as a new service under the ilm
plugin. This service handles snapshot lifecycle policies by scheduling based on
the policies defined schedule.

This also includes the get, put, and delete APIs for these policies

Relates to elastic#38461
dakrone added a commit to dakrone/elasticsearch that referenced this issue Mar 14, 2019
This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.

Relates to elastic#38461
dakrone added a commit that referenced this issue Mar 18, 2019
(Note this is a PR against the `snapshot-lifecycle-management` feature branch)

This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.

Relates to #38461
@AlexP-Elastic
Copy link

AlexP-Elastic commented Mar 19, 2019

@dakrone asked the cloud team to discuss our requirements, with a view to potentially replacing our snapshot logic down the road. I'm going to talk about how it currently works and then very quickly summarize in a list of (high level) requirements at the bottom:

  • We take snapshots at user-specified periods
    • actually an oft-requested user requirement (that we haven't implemented, but should definitely be part of any "blank sheet of paper" approach) is the ability to make it a schedule, eg "*:30" instead of "every 30 minutes"
  • Our retention policy is quite (over!) complicated, a simplified summary is:
    • The user decides how many snapshots to keep (default is 100) and the snapshot period (30 mins)
    • If a snapshot fails or partially fails a minimum number of good snapshots is retained, regardless of any other considerations
    • (there are some other retention params, but these are not currently exposed to the cluster admin)
    • Purging of "expired" snapshots is handled as follows:
      • After a snapshot has been taken (ie at the user specified interval), we look for the oldest snapshots that do not satisfy the "retain" rules and delete 1+ of them up to an admin-configurable maximum, which defaults to 2
        • (the reason for this max number is because you can't cancel a snapshot deletion, which results in some complications described below)
        • (there's some logic to increase this number where it is safe to do so that I haven't looked at yet!)
  • The previous two operations (snapshot and delete) are performed by (a sidecar attached to) one of the nodes
  • One of the complications is that when a user (or admin or bot) performs a configuration change to an Elasticsearch cluster (handled by a service called the "constructor"), it takes a "safety" snapshot first.
    • If the "snapshotter service" described above is taking a snapshot, the "constructor" will wait for that to complete and use that instead.
    • If the "snapshotter service" is in the middle of deleting a snapshot then the "safety snapshot" will fail after a few (3) retries, causing the entire configuration to fail (this is obviously highly undesirable)
    • Before starting, the "constructor service" will write a flag telling the "snapshotter service" to pause
  • We don't currently have any ability to control which indices are included in the snapshot (though that gets asked for a decent amount)

The Cloud requirements I infer:

  • Configurable snapshot schedule and retention policy
  • The ability to pause the SLM process (during cluster configuration)
  • The ability to control time spent deleting snapshots (ie to avoid conflicts with cluster configuration)

cc @nordbergm / @paulcoghlan - feel free to add/correct/amend anything you think is useful. Don't reply, just edit the comment directly.

@nordbergm
Copy link

nordbergm commented Mar 19, 2019

Snapshot Resiliency

One of the major things for me is the snapshot resiliency work we've been doing for the past 6 months or more. This effectively boils down to the challenges S3 eventual consistency has caused us with corrupt snapshots, and the cool down periods we've had to introduce as a result.

  • Every write operation to the repository (create snapshot, delete snapshot) has to be followed by a cool down period (currently 10 minutes, we might experiment with reducing this to 5), to allow S3 caches to expire between operations.
  • The constructor also adheres to these cool down periods and won't create a safety snapshot until the period has passed (this is still behind feature flag, but will be enabled once we've worked out a couple of issues).
  • We now use a ZK leader latch to coordinate this as well as shared snapshot task state in ZK. This allows the constructor and snapshot sidecar to coordinate operations against the repository and adhere to the cool down period. The flag @AlexP-Elastic mentioned will be superseded by this once the feature flag is enabled.
  • Because of cool down periods, a strict *:30 schedule is difficult to guarantee (unless you're willing to skip an interval if you run over)

These are very specific Cloud/S3 challenges. I don't believe GCP necessarily has the same issues because GCS is way more consistent. Still, if we don't consider it I worry we'll suffer more snapshot corruption again.

Access Control

Another area I've been thinking about is access control. Snapshots in Cloud today are controlled by cloud admins, and can't easily be meddled with by the cluster admin. They can reduce the retention to minimum 2 snapshots, and they can disable snapshots if they go through support and understand the risks etc.

With ILM/SLM it would be good to understand what kind of access the cluster admins would have to configuration and how we could restrict access. In case of disaster we want to be sure the cluster has snapshots and the cluster admin hasn't broken the configuration by accident.

dakrone added a commit to dakrone/elasticsearch that referenced this issue Mar 22, 2019
This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.

This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.

Relates to elastic#38461
dakrone added a commit that referenced this issue Mar 26, 2019
(This is a PR for the `snapshot-lifecycle-management` branch)

This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.

This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.

Relates to #38461
@yaronp68
Copy link

@dakrone Would it be possible to add some meta-data argument to the create policy API that would help the user figure which policy created the snapshot . two options come to mind:

  1. policy have a unique name and the snapshot will have a created by which will either "API_call" or [policy name]
  2. Alternatively a metadata field with max x chars that the user fill as optional description when creating the policy and the policy adds to each snapshot as metadata field

@dakrone
Copy link
Member Author

dakrone commented Mar 28, 2019

@yaronp68 interesting suggestion, I do think that would be useful.

@original-brownbear what do you think about us adding something like that to the CreateSnapshotRequest? I'm not sure exactly what the backwards compatibility issues could be (I assume nothing too difficult to work around).

@original-brownbear
Copy link
Member

@dakrone @yaronp68

technically speaking there is no reason not to add a metadata field to the snapshot in cluster-state and stored in the repository as far as I can see.
The question is, whether it's worth the added complexity I guess :) I'm not against it, but if we can do it without adding more complexity to the cluster state and repository that may be better.
=> my question: If you're already planning to "Persist a history of successful/failed snapshots in an ES index", why not just add the metadata for each snapshot to the history in that index?

@yaronp68
Copy link

@dakrone maybe it's possible to persist to a metadata file in the repository and not in cluster state to avoid changes to cluster state

@dakrone
Copy link
Member Author

dakrone commented Mar 29, 2019

my question: If you're already planning to "Persist a history of successful/failed snapshots in an ES index", why not just add the metadata for each snapshot to the history in that index?

@original-brownbear I believe the idea is that when listing snapshots for a repository, you could then tell which snapshot came from what (manually triggered, triggered via policyA, policyB, etc). We will have something on the other side (for a policy, what's the last snapshot taken), I think the desire was for something the other direction.

@dakrone maybe it's possible to persist to a metadata file in the repository and not in cluster state to avoid changes to cluster state

@yaronp68 we need to persist at least one end state in the cluster state, because in the event of a snapshot failure, we wouldn't be able to persist it in a metadata file in the repo, because the snapshot failed :) so we have to have a place to have something like "your snapshot failed because of XYZ" for users to see.

@original-brownbear
Copy link
Member

@yaronp68

maybe it's possible to persist to a metadata file in the repository and not in cluster state to avoid changes to cluster state

I would rather we not do this, sorry. The repository is currently undergoing some redesign to resolve issues like #38941.
If we start putting custom blobs in the repo that's gonna be one more thing to worry about when we make changes there. Plus, the eventually consistent nature of some blob-stores like s3 will also create problems for a metadata file/blob that would be read and updated I would assume?

-> the private index for the snapshot history seems like the safest bet to me still. If that's not an option for some reason the cluster state is still the better option compared to a custom repository blob.

@dakrone
Copy link
Member Author

dakrone commented Mar 29, 2019

the private index for the snapshot history seems like the safest bet to me still. If that's not an option for some reason the cluster state is still the better option compared to a custom repository blob.

We are planning to store the latest success and failure in the cluster state (only one of each), and store the result for every snapshot invocation into an index for history/alerting purposes.

@original-brownbear
Copy link
Member

I believe the idea is that when listing snapshots for a repository, you could then tell which snapshot came from what (manually triggered, triggered via policyA, policyB, etc). We will have something on the other side (for a policy, what's the last snapshot taken), I think the desire was for something the other direction

I see. In that case I think adding this information to the cluster state (and then as a result to the snapshot metadata we store in the repository) may be an option. In the end, the repository is the only place we can store that metadata to if we want to be able to use it with the snapshot list.

@dakrone
Copy link
Member Author

dakrone commented Jul 30, 2019

Going to close this as SLM has been merged to master and 7.x and will be in the 7.4 release.

Further work on retention can be found at #43663

@dakrone dakrone closed this as completed Jul 30, 2019
dakrone added a commit that referenced this issue Sep 9, 2019
This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.

An example policy would look like:

```
PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}
```

SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.

Included in this work is a new SLM stats API endpoint available through

``` json
GET /_slm/stats
```

That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.

* Add base framework for snapshot retention (#43605)

* Add base framework for snapshot retention

This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.

Relates to #38461

* Remove extraneous 'public'

* Use a local var instead of reading class var repeatedly

* Add SnapshotRetentionConfiguration for retention configuration (#43777)

* Add SnapshotRetentionConfiguration for retention configuration

This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.

Relates to #43663

* Fix REST tests

* Fix more documentation

* Use Objects.equals to avoid NPE

* Put `randomSnapshotLifecyclePolicy` in only one place

* Occasionally return retention with no configuration

* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)

* Implement SnapshotRetentionTask's snapshot filtering and deletion

This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.

Relates to #43663

* Fix deletes running on the wrong thread

* Handle missing or null policy in snap metadata differently

* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>

* Use the `OriginSettingClient` to work with security, enhance logging

* Prevent NPE in test by mocking Client

* Allow empty/missing SLM retention configuration (#45018)

Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.

Relates to #43663

* Add min_count and max_count as SLM retention predicates (#44926)

This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.

These options are optional and one, two, or all three can be specified
in an SLM policy.

Relates to #43663

* Time-bound deletion of snapshots in retention delete function (#45065)

* Time-bound deletion of snapshots in retention delete function

With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.

Relates to #43663

* Wow snapshots suuuure can take a long time.

* Use a LongSupplier instead of actually sleeping

* Remove TestLogging annotation

* Remove rate limiting

* Add SLM metrics gathering and endpoint (#45362)

* Add SLM metrics gathering and endpoint

This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.

This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:

```
GET /_slm/stats
{
  "retention_runs" : 13,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "1.4s",
  "retention_deletion_time_millis" : 1404,
  "policy_metrics" : {
    "daily-snapshots2" : {
      "snapshots_taken" : 7,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 6,
      "snapshot_deletion_failures" : 0
    },
    "daily-snapshots" : {
      "snapshots_taken" : 12,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 12,
      "snapshot_deletion_failures" : 6
    }
  },
  "total_snapshots_taken" : 19,
  "total_snapshots_failed" : 0,
  "total_snapshots_deleted" : 18,
  "total_snapshot_deletion_failures" : 6
}
```

This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.

Relates to #43663

* Version qualify serialization

* Initialize counters outside constructor

* Use computeIfAbsent instead of being too verbose

* Move part of XContent generation into subclass

* Fix REST action for master merge

* Unused import

*  Record history of SLM retention actions (#45513)

This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.

* Retry SLM retention after currently running snapshot completes (#45802)

* Retry SLM retention after currently running snapshot completes

This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.

Relates to #43663

* Increase timeout waiting for snapshot completion

* Apply patch

From https://github.com/original-brownbear/elasticsearch/commit/2374316f0d1912c9e1498bece195546a1dc60bce.patch

* Rename test variables

* [TEST] Be less strict for stats checking

* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)

This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.

Relates to #43663

* Check all actions preventing snapshot delete during retention (#45992)

* Check all actions preventing snapshot delete during retention run

Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:

- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running

This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.

Relates to #43663

* Add unit test for okayToDeleteSnapshots

* Fix bug where SLM retention task would be scheduled on every node

* Enhance test logging

* Ignore if snapshot is already deleted

* Missing import

* Fix SnapshotRetentionServiceTests

* Expose SLM policy stats in get SLM policy API (#45989)

This also adds support for the SLM stats endpoint to the high level rest client.

Retrieving a policy now looks like:

```json
{
  "daily-snapshots" : {
    "version": 1,
    "modified_date": "2019-04-23T01:30:00.000Z",
    "modified_date_millis": 1556048137314,
    "policy" : {
      "schedule": "0 30 1 * * ?",
      "name": "<daily-snap-{now/d}>",
      "repository": "my_repository",
      "config": {
        "indices": ["data-*", "important"],
        "ignore_unavailable": false,
        "include_global_state": false
      },
      "retention": {}
    },
    "stats": {
      "snapshots_taken": 0,
      "snapshots_failed": 0,
      "snapshots_deleted": 0,
      "snapshot_deletion_failures": 0
    },
    "next_execution": "2019-04-24T01:30:00.000Z",
    "next_execution_millis": 1556048160000
  }
}
```

Relates to #43663

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase

This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to #43663
Resolves #46205

* Add error logging when exceptions are thrown
dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 9, 2019
This commit adds retention to the existing Snapshot Lifecycle Management feature (elastic#38461) as described in elastic#43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.

An example policy would look like:

```
PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}
```

SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.

Included in this work is a new SLM stats API endpoint available through

``` json
GET /_slm/stats
```

That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. elastic#45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.

* Add base framework for snapshot retention (elastic#43605)

* Add base framework for snapshot retention

This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.

Relates to elastic#38461

* Remove extraneous 'public'

* Use a local var instead of reading class var repeatedly

* Add SnapshotRetentionConfiguration for retention configuration (elastic#43777)

* Add SnapshotRetentionConfiguration for retention configuration

This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.

Relates to elastic#43663

* Fix REST tests

* Fix more documentation

* Use Objects.equals to avoid NPE

* Put `randomSnapshotLifecyclePolicy` in only one place

* Occasionally return retention with no configuration

* Implement SnapshotRetentionTask's snapshot filtering and delet… (elastic#44764)

* Implement SnapshotRetentionTask's snapshot filtering and deletion

This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.

Relates to elastic#43663

* Fix deletes running on the wrong thread

* Handle missing or null policy in snap metadata differently

* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>

* Use the `OriginSettingClient` to work with security, enhance logging

* Prevent NPE in test by mocking Client

* Allow empty/missing SLM retention configuration (elastic#45018)

Semi-related to elastic#44465, this allows the `"retention"` configuration map
to be missing.

Relates to elastic#43663

* Add min_count and max_count as SLM retention predicates (elastic#44926)

This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.

These options are optional and one, two, or all three can be specified
in an SLM policy.

Relates to elastic#43663

* Time-bound deletion of snapshots in retention delete function (elastic#45065)

* Time-bound deletion of snapshots in retention delete function

With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.

Relates to elastic#43663

* Wow snapshots suuuure can take a long time.

* Use a LongSupplier instead of actually sleeping

* Remove TestLogging annotation

* Remove rate limiting

* Add SLM metrics gathering and endpoint (elastic#45362)

* Add SLM metrics gathering and endpoint

This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.

This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:

```
GET /_slm/stats
{
  "retention_runs" : 13,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "1.4s",
  "retention_deletion_time_millis" : 1404,
  "policy_metrics" : {
    "daily-snapshots2" : {
      "snapshots_taken" : 7,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 6,
      "snapshot_deletion_failures" : 0
    },
    "daily-snapshots" : {
      "snapshots_taken" : 12,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 12,
      "snapshot_deletion_failures" : 6
    }
  },
  "total_snapshots_taken" : 19,
  "total_snapshots_failed" : 0,
  "total_snapshots_deleted" : 18,
  "total_snapshot_deletion_failures" : 6
}
```

This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.

Relates to elastic#43663

* Version qualify serialization

* Initialize counters outside constructor

* Use computeIfAbsent instead of being too verbose

* Move part of XContent generation into subclass

* Fix REST action for master merge

* Unused import

*  Record history of SLM retention actions (elastic#45513)

This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.

* Retry SLM retention after currently running snapshot completes (elastic#45802)

* Retry SLM retention after currently running snapshot completes

This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.

Relates to elastic#43663

* Increase timeout waiting for snapshot completion

* Apply patch

From https://github.com/original-brownbear/elasticsearch/commit/2374316f0d1912c9e1498bece195546a1dc60bce.patch

* Rename test variables

* [TEST] Be less strict for stats checking

* Skip SLM retention if ILM is STOPPING or STOPPED (elastic#45869)

This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.

Relates to elastic#43663

* Check all actions preventing snapshot delete during retention (elastic#45992)

* Check all actions preventing snapshot delete during retention run

Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:

- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running

This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.

Relates to elastic#43663

* Add unit test for okayToDeleteSnapshots

* Fix bug where SLM retention task would be scheduled on every node

* Enhance test logging

* Ignore if snapshot is already deleted

* Missing import

* Fix SnapshotRetentionServiceTests

* Expose SLM policy stats in get SLM policy API (elastic#45989)

This also adds support for the SLM stats endpoint to the high level rest client.

Retrieving a policy now looks like:

```json
{
  "daily-snapshots" : {
    "version": 1,
    "modified_date": "2019-04-23T01:30:00.000Z",
    "modified_date_millis": 1556048137314,
    "policy" : {
      "schedule": "0 30 1 * * ?",
      "name": "<daily-snap-{now/d}>",
      "repository": "my_repository",
      "config": {
        "indices": ["data-*", "important"],
        "ignore_unavailable": false,
        "include_global_state": false
      },
      "retention": {}
    },
    "stats": {
      "snapshots_taken": 0,
      "snapshots_failed": 0,
      "snapshots_deleted": 0,
      "snapshot_deletion_failures": 0
    },
    "next_execution": "2019-04-24T01:30:00.000Z",
    "next_execution_millis": 1556048160000
  }
}
```

Relates to elastic#43663

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (elastic#46356)

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase

This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to elastic#43663
Resolves elastic#46205

* Add error logging when exceptions are thrown
dakrone added a commit that referenced this issue Sep 10, 2019
* Add retention to Snapshot Lifecycle Management (#46407)

This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.

An example policy would look like:

```
PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}
```

SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.

Included in this work is a new SLM stats API endpoint available through

``` json
GET /_slm/stats
```

That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.

* Add base framework for snapshot retention (#43605)

* Add base framework for snapshot retention

This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.

Relates to #38461

* Remove extraneous 'public'

* Use a local var instead of reading class var repeatedly

* Add SnapshotRetentionConfiguration for retention configuration (#43777)

* Add SnapshotRetentionConfiguration for retention configuration

This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.

Relates to #43663

* Fix REST tests

* Fix more documentation

* Use Objects.equals to avoid NPE

* Put `randomSnapshotLifecyclePolicy` in only one place

* Occasionally return retention with no configuration

* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)

* Implement SnapshotRetentionTask's snapshot filtering and deletion

This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.

Relates to #43663

* Fix deletes running on the wrong thread

* Handle missing or null policy in snap metadata differently

* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>

* Use the `OriginSettingClient` to work with security, enhance logging

* Prevent NPE in test by mocking Client

* Allow empty/missing SLM retention configuration (#45018)

Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.

Relates to #43663

* Add min_count and max_count as SLM retention predicates (#44926)

This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.

These options are optional and one, two, or all three can be specified
in an SLM policy.

Relates to #43663

* Time-bound deletion of snapshots in retention delete function (#45065)

* Time-bound deletion of snapshots in retention delete function

With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.

Relates to #43663

* Wow snapshots suuuure can take a long time.

* Use a LongSupplier instead of actually sleeping

* Remove TestLogging annotation

* Remove rate limiting

* Add SLM metrics gathering and endpoint (#45362)

* Add SLM metrics gathering and endpoint

This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.

This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:

```
GET /_slm/stats
{
  "retention_runs" : 13,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "1.4s",
  "retention_deletion_time_millis" : 1404,
  "policy_metrics" : {
    "daily-snapshots2" : {
      "snapshots_taken" : 7,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 6,
      "snapshot_deletion_failures" : 0
    },
    "daily-snapshots" : {
      "snapshots_taken" : 12,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 12,
      "snapshot_deletion_failures" : 6
    }
  },
  "total_snapshots_taken" : 19,
  "total_snapshots_failed" : 0,
  "total_snapshots_deleted" : 18,
  "total_snapshot_deletion_failures" : 6
}
```

This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.

Relates to #43663

* Version qualify serialization

* Initialize counters outside constructor

* Use computeIfAbsent instead of being too verbose

* Move part of XContent generation into subclass

* Fix REST action for master merge

* Unused import

*  Record history of SLM retention actions (#45513)

This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.

* Retry SLM retention after currently running snapshot completes (#45802)

* Retry SLM retention after currently running snapshot completes

This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.

Relates to #43663

* Increase timeout waiting for snapshot completion

* Apply patch

From https://github.com/original-brownbear/elasticsearch/commit/2374316f0d1912c9e1498bece195546a1dc60bce.patch

* Rename test variables

* [TEST] Be less strict for stats checking

* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)

This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.

Relates to #43663

* Check all actions preventing snapshot delete during retention (#45992)

* Check all actions preventing snapshot delete during retention run

Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:

- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running

This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.

Relates to #43663

* Add unit test for okayToDeleteSnapshots

* Fix bug where SLM retention task would be scheduled on every node

* Enhance test logging

* Ignore if snapshot is already deleted

* Missing import

* Fix SnapshotRetentionServiceTests

* Expose SLM policy stats in get SLM policy API (#45989)

This also adds support for the SLM stats endpoint to the high level rest client.

Retrieving a policy now looks like:

```json
{
  "daily-snapshots" : {
    "version": 1,
    "modified_date": "2019-04-23T01:30:00.000Z",
    "modified_date_millis": 1556048137314,
    "policy" : {
      "schedule": "0 30 1 * * ?",
      "name": "<daily-snap-{now/d}>",
      "repository": "my_repository",
      "config": {
        "indices": ["data-*", "important"],
        "ignore_unavailable": false,
        "include_global_state": false
      },
      "retention": {}
    },
    "stats": {
      "snapshots_taken": 0,
      "snapshots_failed": 0,
      "snapshots_deleted": 0,
      "snapshot_deletion_failures": 0
    },
    "next_execution": "2019-04-24T01:30:00.000Z",
    "next_execution_millis": 1556048160000
  }
}
```

Relates to #43663

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase

This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to #43663
Resolves #46205

* Add error logging when exceptions are thrown

* Update serialization versions

* Fix type inference

* Use non-Cancellable HLRC return value

* Fix Client mocking in test

* Fix SLMSnapshotBlockingIntegTests for 7.x branch

* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval

* Add serialization guards for SnapshotLifecyclePolicy
dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 17, 2019
This adds the `xpack.slm.enabled` setting to allow disabling of SLM
functionality as well as its HTTP API endpoints.

Relates to elastic#38461
dakrone added a commit that referenced this issue Sep 17, 2019
This adds the `xpack.slm.enabled` setting to allow disabling of SLM
functionality as well as its HTTP API endpoints.

Relates to #38461
dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 17, 2019
This adds the `xpack.slm.enabled` setting to allow disabling of SLM
functionality as well as its HTTP API endpoints.

Relates to elastic#38461
dakrone added a commit that referenced this issue Sep 17, 2019
This adds the `xpack.slm.enabled` setting to allow disabling of SLM
functionality as well as its HTTP API endpoints.

Relates to #38461
dakrone added a commit that referenced this issue Sep 17, 2019
This adds the `xpack.slm.enabled` setting to allow disabling of SLM
functionality as well as its HTTP API endpoints.

Relates to #38461
dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 24, 2019
This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

Relates to elastic#38461 and elastic#43663
dakrone added a commit that referenced this issue Sep 25, 2019
* Wait for snapshot completion in SLM snapshot invocation

This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

Relates to #38461 and #43663
dakrone added a commit that referenced this issue Sep 25, 2019
* Wait for snapshot completion in SLM snapshot invocation

This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

Relates to #38461 and #43663
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >feature release highlight v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

No branches or pull requests

9 participants