Add support for pause/unpause operation in snapshot repository #48493

bizybot · 2019-10-24T23:03:40Z

Currently, we do not have a way to pause/unpause a snapshot repository for some time. This operation would help users in scenarios where we want to explicitly disable snapshot repository for some time and prevent users from initiating operations like snapshot/restore/delete/repository-cleanup.

We are working on adding client-side encrypted snapshots in #41910. As part of the key rotation step, the master key will be updated and then for all the snapshots for that repository we need to decrypt the old metadata (related to encrypted blobs) with the old key, re-encrypt with the new key and update the store.

When this process is in progress, we anticipate an impact on operations like a snapshot, restore, delete snapshots or repository cleanup. We might be able to continue with the snapshot operations by using the newly updated master key but the restore/delete/cleanup operations still can be impacted.

This enhancement if introduced can have an impact on the working of SLM and discussion/extra handling might be required to be considered.

I am opening this issue to discuss possibilities/options. Thank you.

elasticmachine · 2019-10-24T23:03:41Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

elasticmachine · 2019-10-24T23:03:42Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

original-brownbear · 2019-10-25T05:22:31Z

I would hope that we can implement this in a way that doesn't require blocking all snapshot operations while the key rotation is happening. We are currently looking into allowing parallel operations on the repository (i.e. allowing delete, create to run at the same time) and I'd hope we can somehow fold this into that effort.
If we don't allow the key-rotation to run in parallel, we'll introduce a large window during which the user can't snapshot (since rotating the key effectively means re-uploading the whole repo) which might not be desirable?

I think with the recent work in #46250 and upcoming work building on top of it, we should be able to implement key-rotation in parallel to other operations by simply rewriting the data shard by shard and only locking the repo for brief period of time when updating the shard-level metadata.
Concretely I'd imagine key rotation to work like this:

Start key rotation which puts a RepositoryKeyRotationInProgress entry into the CS that keeps track of all shards in the repository in the same way ShardGenerations already does.
Rotate keys shard by shard on the data nodes (those could be holding the blobs/files to be re-encrypted locally already! Saving downloading the blobs if we're a little smart about the allocation of the rotation tasks here) and update the status of the rotation in the CS.

While a RepositoryKeyRotationInProgress is in progress:

We do all new snapshotting/deleting using the new key already
We only have to lock the step of updating the per-shard metadata on a per-shard basis instead of the whole repository

I think we will have to be smart about this anyway in that we probably want to distribute the key-rotation step (just running this on a single (master-)node means a lot of encrypting and data transfer in and out of that single node) so adding the parallelism to this operation when we're looking into parallelising other operations already anyway seems like the way to go here.

bizybot · 2019-10-25T06:56:32Z

Thank you @original-brownbear for your comments.

I would hope that we can implement this in a way that doesn't require blocking all snapshot operations while the key rotation is happening.

Yes, if we can do it then that would be the preferred way.
But considering there other operations like deletion/repo-cleanup I guess it could be problematic doing key rotation and deletion in parallel and may even incur unnecessary work when eventually we are going to delete those files.
Snapshot operation might be fine as it can use the new key but restore will fail (if the key rotation is not yet complete and you use the new key to decrypt). This also introduces too many ways to fail if not done properly.

If we don't allow the key-rotation to run in parallel, we'll introduce a large window during which the user can't snapshot (since rotating the key effectively means re-uploading the whole repo) which might not be desirable?

I think it is desirable to do the key rotation in parallel and if we can do it then awesome.
The key rotation process involves downloading only the metadata files containing encrypted DEK and re-encrypting it with a new KEK. Note that we do not need to do this for the whole encrypted blobs but only for the metadata files containing the data encryption key encrypted using the old master key. So if we do this rotate operation in parallel for all metadata files for all snapshots then it could be faster assuming we can selectively work with the metadata files from the repository.

original-brownbear · 2019-10-25T08:39:28Z

The key rotation process involves downloading only the metadata files containing encrypted DEK and re-encrypting it with a new KEK.

right ... my bad :) Actually now I remember my original idea for this as well from the Google Doc (thanks for refreshing my memory :))
I think we can simply do key ratation without any blocking/locking whatsoever this way:

Rotate key via a CS event, so there is a clear point at which new writes to the repo use the new key.
Keep a reference to the fact that old keys are still used in the CS. As long as this reference in the CS is around the logic should try using the old key for decryption frst and then if it doesn't exist use the new key (this order is important so we don't accidentally read a non-existent new key metablob when it doesn't exist and mess up first-read-after-write consistency on S3).
Simply list out all the meta blobs with the old keys in them (we can do this efficiently by changing the prefix of the meta-blobs when we change keys) and then write updated versions for them.
Remove reference to old keys still being in use from the CS and only use the new-key prefix meta blobs going forward.

But considering there other operations like deletion/repo-cleanup I guess it could be problematic doing key rotation and deletion in parallel and may even incur unnecessary work when eventually we are going to delete those files.

True, but I'd say that's something we can optimize for later. Key rotation is a relatively rare event and how efficient it is in terms of API call counts might not matter too much? I think it's likely the right trade-off to live with some redundant metadata updates but keep the repository fully functional during key-ratation compared to blocking the repository for potentially hours to save a trivial amount of work (logically you could also argue that if you pause deletes to prevent concurrent deletes you'll waste the effort for updating meta-blobs for blobs that you'll delete right after the pause anyway ...).

-> I think the steps in the Google doc and in this comment are still a valid approach here

bizybot · 2019-10-30T23:15:29Z

Thank you @original-brownbear for your inputs.
I see your point and makes sense for us not to pause the whole process as things can be achieved in parallel operations.

I am closing this issue as I do not see a need for this API anymore. Thank you.

bizybot added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss labels Oct 24, 2019

bizybot removed the team-discuss label Oct 30, 2019

bizybot closed this as completed Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for pause/unpause operation in snapshot repository #48493

Add support for pause/unpause operation in snapshot repository #48493

bizybot commented Oct 24, 2019

elasticmachine commented Oct 24, 2019

elasticmachine commented Oct 24, 2019

original-brownbear commented Oct 25, 2019

bizybot commented Oct 25, 2019

original-brownbear commented Oct 25, 2019

bizybot commented Oct 30, 2019 •

edited

Loading

Add support for pause/unpause operation in snapshot repository #48493

Add support for pause/unpause operation in snapshot repository #48493

Comments

bizybot commented Oct 24, 2019

elasticmachine commented Oct 24, 2019

elasticmachine commented Oct 24, 2019

original-brownbear commented Oct 25, 2019

bizybot commented Oct 25, 2019

original-brownbear commented Oct 25, 2019

bizybot commented Oct 30, 2019 • edited Loading

bizybot commented Oct 30, 2019 •

edited

Loading