Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make snapshot deletion faster #61513

Closed
piyushdaftary opened this issue Aug 25, 2020 · 4 comments · Fixed by #100316
Closed

Make snapshot deletion faster #61513

piyushdaftary opened this issue Aug 25, 2020 · 4 comments · Fixed by #100316
Assignees
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@piyushdaftary
Copy link
Contributor

piyushdaftary commented Aug 25, 2020

Elasticsearch version (bin/elasticsearch --version): 7.6 onwards

JVM version (java -version): Java 14

OS version (uname -a if on a Unix-like system): CentOs

In Elasticsearch, snapshot deletion is a multithreaded synchronous master node operation. The sequence of the delete operation goes as follows:

1. Master node receives snapshot deletion request and registers a listener for snapshot deletion.
2. If snapshot (for which delete request is received) is in progress, stop the shard snapshot 
3. Update the cluster state with snapshot deletion entry (SnapshotDeletionsInProgress).
4. Fetch snapshot entry from s3 repository.
5. Get list of all children folder in repository under indices folder path
6. Update the shard state metadata in repository for all shards of the snapshot that is to be deleted and compute shards to be deleted from repository
7. Remove the snapshot from the list of existing snapshots in repository
8. Update new generations: 
    1. Update the index shard generations (new  generations) of all updated shard folders to next generation in Repository 
    2. Update the index-N to this new  generations in repository and make index.latest point to this new Index-N
    3. Update the new  generations in cluster state 
9. Run Cleanup operation 
    1. Delete repository root level snap-UUID and meta-UUID file of deleting snapshot 
    2. Delete stale indices folder from repository which are not referenced by any snapshots
    3. List shard level list of files to be deleted from repository
    4. Remove shard level files (in bulk delete fashion) from repository that are not referenced 
10. Update the cluster state by removing the snapshot entry and call back on the listeners
11. Listener called back which in turn responds back to the user indicating snapshot deletion is complete

Current implementation of "step 9.2 : Delete stale indices" : cleanupStaleIndices() is very slow . Snapshot deletion code deletes each stale indices from repository one after another synchronously .

With current implementation we tried to measure time taken to delete a snapshot for a cluster with 3 masters of type r5.12xlarge and 50 data nodes of type i3.4xlarge with 1601 indices and 8001 shards and 4.8 TB data. IT takes approximately 31 minutes to delete such snapshot.

  Shards # Indices # Snapshot Creation Time(Avg) Snapshot Deletion Time(Avg)
Current Implementation 8001 1601 8.4 min 31.1 min

Current flow diagram of cleanupStaleIndices :

Existing_Flow_Master_Delete1

This step to cleanup stale indices can be speed up with either of the following approaches :

Suggested Optimizations

Approach1 :

Instead of making snapshot delete of stale indices a single threaded operation we make it multithreaded operation and delete multiple stale indices in parallel using SNAPSHOT thread pool's workers.
When deletion of all the stale indices are complete , we return back the DeleteResult as response of method cleanupStaleIndices()

With above Approach1 time taken to delete a snapshot for similar cluster with 3 masters of type r5.12xlarge and 50 data nodes of type i3.4xlarge with 1601 indices and 8001 shards and 4.8 TB data. IT takes approximately only 9.8 minutes to delete such snapshot.

I noted system resource utilizations such is CPU, system memory of master node. There was no major change in resource utilizations in Approach1 compared to Current implementation

Approach1 Optimization Vs Current Implementation comparison :

  Shards # Indices # Snapshot Creation Time(Avg) Snapshot Deletion Time(Avg)
Approach1 Optimization 8001 1601 6.9 min 9.86 min
Current Implementation 8001 1601 6.8 min 31.1 min

Approach1 Optimization flow diagram of cleanupStaleIndices :

Existing_Flow_Master_Delete-Approach-1 (1)

Approach2 :

Instead of deleting snapshot stale indices synchronously, we can completely make the method cleanupStaleIndices() asynchronous . When method is invoked to delete list of stale indices, send back the response immediately and do the snapshot deletion of stale indices in background (Using SNAPSHOT threadpool workers ).

With above Approach2 time taken to delete a snapshot for similar cluster with 3 masters of type r5.12xlarge and 50 data nodes of type i3.4xlarge with 1601 indices and 8001 shards and 4.8 TB data. IT takes approximately only 8 seconds to delete such snapshot.

  Shards # Indices # Snapshot Creation Time(Avg) Snapshot Deletion Time(Avg)
Approach2 Optimization 8001 1601 6.8 min 8 sec
Current Delete 8001 1601 8.4 min 31.1 min

In Approach2 if because of master node failure, if deletion of stale indices fails , then these stale indices will be deleted in next snapshot deletion iteration , as these are stale indices and are not been referenced by any snapshots.

To track the progress of stale indices cleanup in background, a new status can be added in cluster state (Open for suggestion on how we can track the progress of stale indices cleanup)

**Approach2 Optimization flow diagram of cleanupStaleIndices : **

Existing_Flow_Master_Delete-Approach-2 (2)

I want to take feedback from community on above 2 snapshot deletion optimization approaches before raising PR .

@piyushdaftary piyushdaftary added >bug needs:triage Requires assignment of a team area label labels Aug 25, 2020
@original-brownbear original-brownbear self-assigned this Aug 25, 2020
@original-brownbear original-brownbear added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Aug 25, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 25, 2020
@original-brownbear
Copy link
Member

Thanks for raising this @piyushdaftary The fact that we only execute this single-threaded really isn't all that optimal I agree and can lead to some very suboptimal situations as in your example.
I would also agree that parallelising this across indices on the SNAPSHOT pool is a reasonable fix. We should go with option 1 here and not push the execution in the background (I in fact suggested doing that in the past and we rejected the idea eventually for various user experience reasons). If you want, feel free to give option 1 a go :) If not I'm happy to implement this quickly as well next week, it should be a fairly simple change I think.

@original-brownbear original-brownbear removed the needs:triage Requires assignment of a team area label label Aug 25, 2020
@piyushdaftary
Copy link
Contributor Author

Thanks @original-brownbear . I will raise the PR with implementation of Approach1.

piyushdaftary added a commit to piyushdaftary/elasticsearch that referenced this issue Nov 3, 2020
AmiStrn added a commit to AmiStrn/elasticsearch that referenced this issue Feb 18, 2021
The delete snapshot task takes longer than expected. A major reason for this is
that the (often many) stale indices are deleted iteratively.
In this commit we change the deletion to be concurrent using the SNAPSHOT threadpool.
Notice that in order to avoid putting too many delete tasks on the threadpool
queue a similar methodology was used as in `executeOneFileSnapshot()`. This is due to
 the fact that the threadpool should allow other tasks to use this threadpool without
too much of a delay.

fixes issue elastic#61513 from Elasticsearch project
AmiStrn added a commit to AmiStrn/elasticsearch that referenced this issue Feb 18, 2021
The delete snapshot task takes longer than expected. A major reason for this is
that the (often many) stale indices are deleted iteratively.
In this commit we change the deletion to be concurrent using the SNAPSHOT threadpool.
Notice that in order to avoid putting too many delete tasks on the threadpool
queue a similar methodology was used as in `executeOneFileSnapshot()`. This is due to
 the fact that the threadpool should allow other tasks to use this threadpool without
too much of a delay.

fixes issue elastic#61513 from Elasticsearch project
@ls-ivan-kiselev
Copy link

Ugh, I suffer from this slowness so much right now, thanks for raising it!

I have a snapshot repo to clean up of 3 years of snapshots every 2 hours and well so far it goes with a speed of 2 snapshots a day.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 5, 2023
After deleting a snapshot today we clean up all the now-dangling indices
sequentially, which can be rather slow. With this commit we parallelize
the work across the whole `SNAPSHOT` pool on the master node.

Closes elastic#61513

Co-authored-by: Piyush Daftary <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
4 participants