Track Shard-Snapshot Index Generation at Repository Root #46250

original-brownbear · 2019-09-03T05:52:22Z

Changes to Root-Level index-N (RepositoryData)

This change adds a new field "shards" to RepositoryData that contains a mapping of IndexId to a String[]. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the N in the name of the currently valid /indices/${indexId}/${shardId}/index-${N} for the shard in question).

Benefits

This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root index-${N} is determined by listing).

Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the master node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.

Only Master Deletes Blobs

This change moves the deleting of the previous shard level index-${uuid} blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the RepositoryData by first updating RepositoryData and then deleting the now unreferenced index-${newUUID} blob.
No deletes are executed on the data nodes at all for any operation with this change.

Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).

Why Move from index-${N} to index-${uuid} at the Shard Level

This change changes the naming of the shard level index-${N} blobs to a uuid suffix index-${UUID}. The reason for this is the fact that writing a new shard-level index- generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level index-N (RepositoryData) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob index-${N+1} is written but due to the node/network/cluster/... crashes the root level RepositoryData has not been updated then a future operation will determine the shard's generation to be N and try to write a new index-${N+1} to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write index-${N+1} but got stuck and tried to do so after some other node had already written index-${N+1} were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the N+1 generation or force us to continue using a LIST operation to figure out the next N (which would make this change pointless).
With uuid naming and moving all deletes to master this becomes a non-issue. Data nodes write updated shard generation index-${uuid} and master makes those index-${uuid} part of the RepositoryData that it deems correct and cleans up all those index- that are unused.

…lity-snapshots

elasticmachine · 2019-09-03T05:52:24Z

Pinging @elastic/es-distributed

…lity-snapshots

This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question). This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see elastic#45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see elastic#38941 ... now only the root `index-${N}` is determined by listing). Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs. This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob. __No deletes are executed on the data nodes at all for any operation with this change.__ Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details). This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository. This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes. Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur. Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless). With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused. Co-authored-by: Yannick Welsch <[email protected]> Co-authored-by: Tanguy Leroux <[email protected]>

This relates to the effort towards elastic#46250. We added tracking of the shard generation for successful snapshots to `8.0`. This assertion isn't correct though. While an `8.0` master won't create an entry with sucess state and a null shard generation it may still (on e.g. master failover) send a success entry created by a 7.x master with a `null` generation over the wire. Closes elastic#47406

This relates to the effort towards #46250. We added tracking of the shard generation for successful snapshots to `8.0`. This assertion isn't correct though. While an `8.0` master won't create an entry with sucess state and a null shard generation it may still (on e.g. master failover) send a success entry created by a 7.x master with a `null` generation over the wire. Closes #47406

This PR introduces two new fields in to `RepositoryData` (index-N) to track the blob name of `IndexMetaData` blobs and their content via setting generations and uuids. This is used to deduplicate the `IndexMetaData` blobs (`meta-{uuid}.dat` in the indices folders under `/indices` so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot. This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time. The implementation is mostly analogous to that for shard generations in #46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into `7.6`). Relates to #45736 as it improves the efficiency of snapshotting unchanged indices Relates to #49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete

Based on elastic/elasticsearch#46250

Based on elastic/elasticsearch#46250 elastic/elasticsearch@be397b7 and elastic/elasticsearch@4849c3e

This PR introduces two new fields in to `RepositoryData` (index-N) to track the blob name of `IndexMetaData` blobs and their content via setting generations and uuids. This is used to deduplicate the `IndexMetaData` blobs (`meta-{uuid}.dat` in the indices folders under `/indices` so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot. This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time. The implementation is mostly analogous to that for shard generations in elastic#46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into `7.6`). Relates to elastic#45736 as it improves the efficiency of snapshotting unchanged indices Relates to elastic#49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete

This PR introduces two new fields in to `RepositoryData` (index-N) to track the blob name of `IndexMetaData` blobs and their content via setting generations and uuids. This is used to deduplicate the `IndexMetaData` blobs (`meta-{uuid}.dat` in the indices folders under `/indices` so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot. This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time. The implementation is mostly analogous to that for shard generations in #46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into `7.6`). Relates to #45736 as it improves the efficiency of snapshotting unchanged indices Relates to #49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete

original-brownbear added 6 commits September 2, 2019 12:58

bck

e978473

sorta works

7afa152

better

14ab8ca

pass

1c9fc7f

bck

32e514f

Merge remote-tracking branch 'elastic/master' into smarter-incrementa…

21253d8

…lity-snapshots

original-brownbear added >enhancement WIP :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Sep 3, 2019

original-brownbear added 2 commits September 3, 2019 09:04

moar asserts

8e896de

fix CCR repo

3b9a791

original-brownbear changed the title ~~Track Shard-Snapshot Index Generationat Repository Root~~ Track Shard-Snapshot Index Generation at Repository Root Sep 3, 2019

original-brownbear added 17 commits September 3, 2019 10:07

fix tests

ac223a7

fix some inconsistencies

9e87c46

remove risky delete

b7e9aab

nicer

9149f74

extract fallback logic

7305618

comment

95f179a

ensure not leaking shard index-N

ac3180f

Merge remote-tracking branch 'elastic/master' into smarter-incrementa…

a11ef41

…lity-snapshots

some bwc

deba084

Merge remote-tracking branch 'elastic/master' into smarter-incrementa…

4e8316b

…lity-snapshots

half fix bwc

cd9fe7a

Merge remote-tracking branch 'elastic/master' into smarter-incrementa…

0d86637

…lity-snapshots

Merge remote-tracking branch 'elastic/master' into smarter-incrementa…

93e3f62

…lity-snapshots

assertion and nicer order

51e5b1a

Merge remote-tracking branch 'elastic/master' into smarter-incrementa…

addcf45

…lity-snapshots

add version param

8e110c9

Merge remote-tracking branch 'elastic/master' into smarter-incrementa…

eb51be5

…lity-snapshots

original-brownbear mentioned this pull request Oct 23, 2019

Track Shard-Snapshot Index Generation at Repository Root #48371

Merged

original-brownbear removed the backport pending label Oct 23, 2019

original-brownbear mentioned this pull request Oct 23, 2019

Cleanup Concurrent RepositoryData Loading #48329

Merged

codebrain mentioned this pull request Oct 25, 2019

7.4.1 meta ticket elastic/elasticsearch-net#4174

Closed

39 tasks

original-brownbear mentioned this pull request Oct 25, 2019

Add support for pause/unpause operation in snapshot repository #48493

Closed

original-brownbear mentioned this pull request Oct 25, 2019

Remove Incorrect Assertion from SnapshotsInProgress (#47458) #48514

Merged

original-brownbear mentioned this pull request Dec 18, 2019

Deduplicate Index Metadata in BlobStore #50278

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

mkleen mentioned this pull request May 25, 2020

Introduce ShardGenerations crate/crate#9990

Closed

5 tasks

This was referenced Jun 22, 2020

Remove obsolete IncompatibleSnapshots Logic crate/crate#10107

Merged

Track Shard-Snapshot Index Generation at Repository Root crate/crate#10128

Merged

mkleen added a commit to crate/crate that referenced this pull request Jun 29, 2020

Track Shard-Snapshot Index Generation at Repository Root

2ab4e7f

Based on elastic/elasticsearch#46250

mkleen added a commit to crate/crate that referenced this pull request Jun 29, 2020

Track Shard-Snapshot Index Generation at Repository Root

e24c7ab

Based on elastic/elasticsearch#46250

mkleen added a commit to crate/crate that referenced this pull request Jun 29, 2020

Track Shard-Snapshot Index Generation at Repository Root

803c0ef

Based on elastic/elasticsearch#46250

mkleen added a commit to crate/crate that referenced this pull request Jun 30, 2020

Track Shard-Snapshot Index Generation at Repository Root

c160f4b

Based on elastic/elasticsearch#46250

mkleen added a commit to crate/crate that referenced this pull request Jun 30, 2020

Track Shard-Snapshot Index Generation at Repository Root

6fae5a6

Based on elastic/elasticsearch#46250

mkleen added a commit to crate/crate that referenced this pull request Jun 30, 2020

Track Shard-Snapshot Index Generation at Repository Root

d1f8663

Based on elastic/elasticsearch#46250 elastic/elasticsearch@be397b7 and elastic/elasticsearch@4849c3e

mergify bot pushed a commit to crate/crate that referenced this pull request Jun 30, 2020

Track Shard-Snapshot Index Generation at Repository Root

f5505cb

Based on elastic/elasticsearch#46250 elastic/elasticsearch@be397b7 and elastic/elasticsearch@4849c3e

mkleen mentioned this pull request Jul 9, 2020

Add recursive deletes for BlobContainers and fix Snapshot Deletion crate/crate#10194

Merged

5 tasks

original-brownbear mentioned this pull request Jul 14, 2020

Deduplicate Index Metadata in BlobStore (#50278) #59514

Merged

original-brownbear mentioned this pull request Dec 2, 2020

Snapshot creation with wait_for_completion: response time longer than snapshot duration #65661

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track Shard-Snapshot Index Generation at Repository Root #46250

Track Shard-Snapshot Index Generation at Repository Root #46250

original-brownbear commented Sep 3, 2019 •

edited

Loading

elasticmachine commented Sep 3, 2019

Track Shard-Snapshot Index Generation at Repository Root #46250

Track Shard-Snapshot Index Generation at Repository Root #46250

Conversation

original-brownbear commented Sep 3, 2019 • edited Loading

Changes to Root-Level index-N (RepositoryData)

Benefits

Only Master Deletes Blobs

Why Move from index-${N} to index-${uuid} at the Shard Level

elasticmachine commented Sep 3, 2019

original-brownbear commented Sep 3, 2019 •

edited

Loading