Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pause/unpause operation in snapshot repository #48493

Closed
bizybot opened this issue Oct 24, 2019 · 6 comments
Closed

Add support for pause/unpause operation in snapshot repository #48493

bizybot opened this issue Oct 24, 2019 · 6 comments
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement

Comments

@bizybot
Copy link
Contributor

bizybot commented Oct 24, 2019

Currently, we do not have a way to pause/unpause a snapshot repository for some time. This operation would help users in scenarios where we want to explicitly disable snapshot repository for some time and prevent users from initiating operations like snapshot/restore/delete/repository-cleanup.

We are working on adding client-side encrypted snapshots in #41910. As part of the key rotation step, the master key will be updated and then for all the snapshots for that repository we need to decrypt the old metadata (related to encrypted blobs) with the old key, re-encrypt with the new key and update the store.

When this process is in progress, we anticipate an impact on operations like a snapshot, restore, delete snapshots or repository cleanup. We might be able to continue with the snapshot operations by using the newly updated master key but the restore/delete/cleanup operations still can be impacted.

This enhancement if introduced can have an impact on the working of SLM and discussion/extra handling might be required to be considered.

I am opening this issue to discuss possibilities/options. Thank you.

@bizybot bizybot added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss labels Oct 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@original-brownbear
Copy link
Member

I would hope that we can implement this in a way that doesn't require blocking all snapshot operations while the key rotation is happening. We are currently looking into allowing parallel operations on the repository (i.e. allowing delete, create to run at the same time) and I'd hope we can somehow fold this into that effort.
If we don't allow the key-rotation to run in parallel, we'll introduce a large window during which the user can't snapshot (since rotating the key effectively means re-uploading the whole repo) which might not be desirable?

I think with the recent work in #46250 and upcoming work building on top of it, we should be able to implement key-rotation in parallel to other operations by simply rewriting the data shard by shard and only locking the repo for brief period of time when updating the shard-level metadata.
Concretely I'd imagine key rotation to work like this:

  1. Start key rotation which puts a RepositoryKeyRotationInProgress entry into the CS that keeps track of all shards in the repository in the same way ShardGenerations already does.
  2. Rotate keys shard by shard on the data nodes (those could be holding the blobs/files to be re-encrypted locally already! Saving downloading the blobs if we're a little smart about the allocation of the rotation tasks here) and update the status of the rotation in the CS.

While a RepositoryKeyRotationInProgress is in progress:

  • We do all new snapshotting/deleting using the new key already
  • We only have to lock the step of updating the per-shard metadata on a per-shard basis instead of the whole repository

I think we will have to be smart about this anyway in that we probably want to distribute the key-rotation step (just running this on a single (master-)node means a lot of encrypting and data transfer in and out of that single node) so adding the parallelism to this operation when we're looking into parallelising other operations already anyway seems like the way to go here.

@bizybot
Copy link
Contributor Author

bizybot commented Oct 25, 2019

Thank you @original-brownbear for your comments.

I would hope that we can implement this in a way that doesn't require blocking all snapshot operations while the key rotation is happening.

Yes, if we can do it then that would be the preferred way.
But considering there other operations like deletion/repo-cleanup I guess it could be problematic doing key rotation and deletion in parallel and may even incur unnecessary work when eventually we are going to delete those files.
Snapshot operation might be fine as it can use the new key but restore will fail (if the key rotation is not yet complete and you use the new key to decrypt). This also introduces too many ways to fail if not done properly.

If we don't allow the key-rotation to run in parallel, we'll introduce a large window during which the user can't snapshot (since rotating the key effectively means re-uploading the whole repo) which might not be desirable?

I think it is desirable to do the key rotation in parallel and if we can do it then awesome.
The key rotation process involves downloading only the metadata files containing encrypted DEK and re-encrypting it with a new KEK. Note that we do not need to do this for the whole encrypted blobs but only for the metadata files containing the data encryption key encrypted using the old master key. So if we do this rotate operation in parallel for all metadata files for all snapshots then it could be faster assuming we can selectively work with the metadata files from the repository.

@original-brownbear
Copy link
Member

The key rotation process involves downloading only the metadata files containing encrypted DEK and re-encrypting it with a new KEK.

right ... my bad :) Actually now I remember my original idea for this as well from the Google Doc (thanks for refreshing my memory :))
I think we can simply do key ratation without any blocking/locking whatsoever this way:

  1. Rotate key via a CS event, so there is a clear point at which new writes to the repo use the new key.
  2. Keep a reference to the fact that old keys are still used in the CS. As long as this reference in the CS is around the logic should try using the old key for decryption frst and then if it doesn't exist use the new key (this order is important so we don't accidentally read a non-existent new key metablob when it doesn't exist and mess up first-read-after-write consistency on S3).
  3. Simply list out all the meta blobs with the old keys in them (we can do this efficiently by changing the prefix of the meta-blobs when we change keys) and then write updated versions for them.
  4. Remove reference to old keys still being in use from the CS and only use the new-key prefix meta blobs going forward.

But considering there other operations like deletion/repo-cleanup I guess it could be problematic doing key rotation and deletion in parallel and may even incur unnecessary work when eventually we are going to delete those files.

True, but I'd say that's something we can optimize for later. Key rotation is a relatively rare event and how efficient it is in terms of API call counts might not matter too much? I think it's likely the right trade-off to live with some redundant metadata updates but keep the repository fully functional during key-ratation compared to blocking the repository for potentially hours to save a trivial amount of work (logically you could also argue that if you pause deletes to prevent concurrent deletes you'll waste the effort for updating meta-blobs for blobs that you'll delete right after the pause anyway ...).

-> I think the steps in the Google doc and in this comment are still a valid approach here

@bizybot
Copy link
Contributor Author

bizybot commented Oct 30, 2019

Thank you @original-brownbear for your inputs.
I see your point and makes sense for us not to pause the whole process as things can be achieved in parallel operations.

I am closing this issue as I do not see a need for this API anymore. Thank you.

@bizybot bizybot closed this as completed Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement
Projects
None yet
Development

No branches or pull requests

3 participants