-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making snapshot deletes distributed across data nodes. #39656
Conversation
Unlike snapshot creation operation, snapshot deletion operation is executed in master. This is acceptable for small to medium sized clusters where the delete operation completes in less than a minute. But, for bigger clusters with few thousand primary shards, the deletion operation takes considerable amount of time as there is only one thread working on deleting the snapshot data index by index, shard by shard. To reduce the time spent on deletion, parallelizing (distributing) the snapshot deletes across all data nodes similar to snapshot creation but with no tight coupling between data node and snapshot shards that are getting deleted. With this changes, for a cluster with 3 dedicated masters and 20 data nodes and 20,000 primary shards, the deletion took around 1 minute where as without this change it took around 70 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that deletes are processed sequentially on the master. I don't think we should add a whole new distributed layer for this, but just look into doing some of these calls concurrently on the master.
@backslasht before working on such a large item, it's best to open an issue first to discuss the solution.
Pinging @elastic/es-distributed |
@ywelsch - Like you mentioned, I initially started with making calls concurrent on the master. It does help a little but still the time taken in quite high. Let us a take following setup as an example.
So, the proposed change makes snapshot delete operation time as a function of number of data nodes and number of primary shards similar to snapshot creation. The deletion time will remain the same with increase in data nodes (and corresponding increase in shards). |
That's a bit of an odd comparison. It all depends on how many concurrent calls you globally do. Whether the calls are done from a single node or multiple nodes does not matter given that the requests are very small and latency seems to be the determining factor. We had a team discussion on this implementation and think it's too complex, which is why I'm closing this PR. We still think this is a problem worth tackling, though, but are looking at other ways of achieving this without building a full distributed task queue. Two options we're looking at is 1) parallelizing deletes on the master, possibly using async APIs, and 2) using repository-specific options (e.g. bulk-deletions on S3). |
Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See #39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in #39657
Motivated by slow snapshot deletes reported in e.g. elastic#39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See elastic#39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in elastic#39657
Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See #39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in #39657
Motivated by slow snapshot deletes reported in e.g. elastic#39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See elastic#39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in elastic#39657
Unlike snapshot creation operation, snapshot deletion operation is executed
in master. This is acceptable for small to medium sized clusters where the
delete operation completes in less than a minute. But, for bigger clusters with
few thousand primary shards, the deletion operation takes considerable amount of
time as there is only one thread working on deleting the snapshot data index by
index, shard by shard.
To reduce the time spent on deletion, parallelizing (distributing) the snapshot deletes
across all data nodes similar to snapshot creation but with no tight coupling between
data node and snapshot shards that are getting deleted.
With this changes, for a cluster with 3 dedicated masters and 20 data nodes and 20,000
primary shards, the deletion took around 1 minute where as without this change it took
around 70 minutes.