-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track Shard-Snapshot Index Generation at Repository Root #46250
Merged
original-brownbear
merged 182 commits into
elastic:master
from
original-brownbear:smarter-incrementality-snapshots
Oct 22, 2019
Merged
Track Shard-Snapshot Index Generation at Repository Root #46250
original-brownbear
merged 182 commits into
elastic:master
from
original-brownbear:smarter-incrementality-snapshots
Oct 22, 2019
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
original-brownbear
added
>enhancement
WIP
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
labels
Sep 3, 2019
Pinging @elastic/es-distributed |
original-brownbear
changed the title
Track Shard-Snapshot Index Generationat Repository Root
Track Shard-Snapshot Index Generation at Repository Root
Sep 3, 2019
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this pull request
Oct 23, 2019
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question). This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see elastic#45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see elastic#38941 ... now only the root `index-${N}` is determined by listing). Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs. This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob. __No deletes are executed on the data nodes at all for any operation with this change.__ Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details). This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository. This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes. Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur. Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless). With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused. Co-authored-by: Yannick Welsch <[email protected]> Co-authored-by: Tanguy Leroux <[email protected]>
39 tasks
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this pull request
Oct 25, 2019
This relates to the effort towards elastic#46250. We added tracking of the shard generation for successful snapshots to `8.0`. This assertion isn't correct though. While an `8.0` master won't create an entry with sucess state and a null shard generation it may still (on e.g. master failover) send a success entry created by a 7.x master with a `null` generation over the wire. Closes elastic#47406
original-brownbear
added a commit
that referenced
this pull request
Oct 25, 2019
This relates to the effort towards #46250. We added tracking of the shard generation for successful snapshots to `8.0`. This assertion isn't correct though. While an `8.0` master won't create an entry with sucess state and a null shard generation it may still (on e.g. master failover) send a success entry created by a 7.x master with a `null` generation over the wire. Closes #47406
This was referenced Feb 3, 2020
5 tasks
original-brownbear
added a commit
that referenced
this pull request
Jun 5, 2020
This PR introduces two new fields in to `RepositoryData` (index-N) to track the blob name of `IndexMetaData` blobs and their content via setting generations and uuids. This is used to deduplicate the `IndexMetaData` blobs (`meta-{uuid}.dat` in the indices folders under `/indices` so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot. This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time. The implementation is mostly analogous to that for shard generations in #46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into `7.6`). Relates to #45736 as it improves the efficiency of snapshotting unchanged indices Relates to #49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete
This was referenced Jun 22, 2020
mkleen
added a commit
to crate/crate
that referenced
this pull request
Jun 29, 2020
mkleen
added a commit
to crate/crate
that referenced
this pull request
Jun 29, 2020
mkleen
added a commit
to crate/crate
that referenced
this pull request
Jun 29, 2020
mkleen
added a commit
to crate/crate
that referenced
this pull request
Jun 30, 2020
mkleen
added a commit
to crate/crate
that referenced
this pull request
Jun 30, 2020
mkleen
added a commit
to crate/crate
that referenced
this pull request
Jun 30, 2020
5 tasks
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this pull request
Jul 14, 2020
This PR introduces two new fields in to `RepositoryData` (index-N) to track the blob name of `IndexMetaData` blobs and their content via setting generations and uuids. This is used to deduplicate the `IndexMetaData` blobs (`meta-{uuid}.dat` in the indices folders under `/indices` so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot. This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time. The implementation is mostly analogous to that for shard generations in elastic#46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into `7.6`). Relates to elastic#45736 as it improves the efficiency of snapshotting unchanged indices Relates to elastic#49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete
original-brownbear
added a commit
that referenced
this pull request
Jul 14, 2020
This PR introduces two new fields in to `RepositoryData` (index-N) to track the blob name of `IndexMetaData` blobs and their content via setting generations and uuids. This is used to deduplicate the `IndexMetaData` blobs (`meta-{uuid}.dat` in the indices folders under `/indices` so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot. This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time. The implementation is mostly analogous to that for shard generations in #46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into `7.6`). Relates to #45736 as it improves the efficiency of snapshotting unchanged indices Relates to #49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
>enhancement
v7.6.0
v8.0.0-alpha1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes to Root-Level index-N (RepositoryData)
This change adds a new field
"shards"
toRepositoryData
that contains a mapping ofIndexId
to aString[]
. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. theN
in the name of the currently valid/indices/${indexId}/${shardId}/index-${N}
for the shard in question).Benefits
This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root
index-${N}
is determined by listing).Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the
master
node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.Only Master Deletes Blobs
This change moves the deleting of the previous shard level
index-${uuid}
blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in theRepositoryData
by first updatingRepositoryData
and then deleting the now unreferencedindex-${newUUID}
blob.No deletes are executed on the data nodes at all for any operation with this change.
Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).
Why Move from index-${N} to index-${uuid} at the Shard Level
This change changes the naming of the shard level
index-${N}
blobs to a uuid suffixindex-${UUID}
. The reason for this is the fact that writing a new shard-levelindex-
generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root levelindex-N
(RepositoryData
) to become an effective part of the snapshot repository.This leads to a problem if we were to use incrementing names like we did before. If a blob
index-${N+1}
is written but due to the node/network/cluster/... crashes the root levelRepositoryData
has not been updated then a future operation will determine the shard's generation to beN
and try to write a newindex-${N+1}
to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.Previously stuck data nodes that were tasked to write
index-${N+1}
but got stuck and tried to do so after some other node had already writtenindex-${N+1}
were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the
N+1
generation or force us to continue using aLIST
operation to figure out the nextN
(which would make this change pointless).With uuid naming and moving all deletes to
master
this becomes a non-issue. Data nodes write updated shard generationindex-${uuid}
andmaster
makes thoseindex-${uuid}
part of theRepositoryData
that it deems correct and cleans up all thoseindex-
that are unused.