Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore from Individual Shard Snapshot Files in Parallel #48110

Merged
merged 49 commits into from
Oct 30, 2019

Conversation

original-brownbear
Copy link
Member

The code in this PR is rather to illustrate the amount of change necessary to allow for faster restores and demonstrate required code changes than for review as it does not limit concurrency in any way.

In #42791 we fixed the order in which files are uploaded to snapshots, making snapshots upload the individual file for each shard in parallel and working shard-by-shard in terms of ordering the uploads for various shards in the snapshot.

For restores from snapshots however we currently run all shards in parallel using only a single thread per shard for downloading files. This is needlessly inefficient and significantly slows down restores from Cloud repositories.

I think we should move to the same ordering for restores. Parallelize by files and order by shards.
This should significantly speed up restores for shards (especially those with many files) as well as speed up the restore process end-to-end since if we order by shards we restore the first primaries more quickly and thus the replica recovery can run in parallel to the restore in a more efficient manner.

@original-brownbear original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs team-discuss labels Oct 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear changed the title Restore from Snapshots in Parallel Restore from Individual Shard Snapshot Files in Parallel Oct 16, 2019
The code here was needlessly complicated when it
enqueued all file uploads up-front. Instead, we can
go with a cleaner worker + queue pattern here by taking
the max-parallelism from the threadpool info.

Also, I slightly simplified the rethrow and
listener (step listener is pointless when you add the callback in the next line)
handling it since I noticed that we were needlessly rethrowing in the same
code and that wasn't worth a separate PR.
@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/2 (unrelated ML failure)

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Armin.

@original-brownbear
Copy link
Member Author

Thanks all!

@original-brownbear original-brownbear merged commit e58fc03 into elastic:master Oct 30, 2019
@original-brownbear original-brownbear deleted the async-restore branch October 30, 2019 11:40
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 30, 2019
Make restoring shard snapshots run in parallel on the `SNAPSHOT` thread-pool.
original-brownbear added a commit that referenced this pull request Oct 30, 2019
…8686)

Make restoring shard snapshots run in parallel on the `SNAPSHOT` thread-pool.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 1, 2019
Follow up to elastic#48110 cleaning up the redundant future
uses that were left over from that change.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 1, 2019
With the changes in elastic#48110 there is no more need
to block a generic thread when waiting for the multi file transfer
in `CcrRepository`.
original-brownbear added a commit that referenced this pull request Nov 1, 2019
With the changes in #48110 there is no more need
to block a generic thread when waiting for the multi file transfer
in `CcrRepository`.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 1, 2019
With the changes in elastic#48110 there is no more need
to block a generic thread when waiting for the multi file transfer
in `CcrRepository`.
original-brownbear added a commit that referenced this pull request Nov 1, 2019
With the changes in #48110 there is no more need
to block a generic thread when waiting for the multi file transfer
in `CcrRepository`.
original-brownbear added a commit that referenced this pull request Nov 2, 2019
Follow up to #48110 cleaning up the redundant future
uses that were left over from that change.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 2, 2019
Follow up to elastic#48110 cleaning up the redundant future
uses that were left over from that change.
original-brownbear added a commit that referenced this pull request Nov 2, 2019
Follow up to #48110 cleaning up the redundant future
uses that were left over from that change.
@original-brownbear original-brownbear restored the async-restore branch August 6, 2020 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants