S3 Snapshot Repository Erroneously Assumes Consistent List Operation #38941

original-brownbear · 2019-02-15T10:41:35Z

We do the following when snapshotting a shard in https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L1177:

        public void snapshot(final IndexCommit snapshotIndexCommit) {
            logger.debug("[{}] [{}] snapshot to [{}] ...", shardId, snapshotId, metadata.name());

            final Map<String, BlobMetaData> blobs;
            try {
                blobs = blobContainer.listBlobs();

to find the latest index-{N} file at the root of each shard folder.
There is no guarantee that this file is actually going to be listed if two snapshots happen in rapid succession and some inconsistency becomes ever more likely the more snapshots one has.
If we hit the wrong N here the subsequent logic of:

                    if (existingFileInfo == null) {
                        indexIncrementalFileCount++;
                        indexIncrementalSize += md.length();
                        // create a new FileInfo
                        BlobStoreIndexShardSnapshot.FileInfo snapshotFileInfo =
                            new BlobStoreIndexShardSnapshot.FileInfo(fileNameFromGeneration(++generation), md, chunkSize());
                        indexCommitPointFiles.add(snapshotFileInfo);
                        filesToSnapshot.add(snapshotFileInfo);
                    } else {
                        indexCommitPointFiles.add(existingFileInfo);
                    }
                }

                snapshotStatus.moveToStarted(startTime, indexIncrementalFileCount,
                    indexTotalNumberOfFiles, indexIncrementalSize, indexTotalFileCount);

                for (BlobStoreIndexShardSnapshot.FileInfo snapshotFileInfo : filesToSnapshot) {
                    try {
                        snapshotFile(snapshotFileInfo);

Could produce incorrect values for generation causing shard data files to collide in naming.
Add to that the fact that the S3 repository has no failIfAlreadyExists logic in place, like the other snapshot repositories (see #36927 for details), one could overwrite shard data files in this scenario and corrupt the repository as far as I can see.

@ywelsch @tlrx maybe I'm missing some hidden step here that prevents getting the wrong N and potentially overwriting shard data files?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-15T10:41:36Z

Pinging @elastic/es-distributed

original-brownbear · 2019-02-17T11:54:02Z

@ywelsch I'm marking this as a bug since the problem after our talk about this. I agree that this is vastly more likely in the scenario of delete and then snapshot compared to subsequent snapshots, but regardless we must find a way to remove our reliance on the list operation here to stabilize things.
I'll come up with a plan shortly.

* There is no functional reason why we need incremental naming for these files but * As explained in elastic#38941 it is a possible source of corrupting the repository * It wastes API calls for the list operation * Is just needless complication * Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix

* Name Snapshot Data Blobs by UUID * There is no functional reason why we need incremental naming for these files but * As explained in #38941 it is a possible source of corrupting the repository * It wastes API calls for the list operation * Is just needless complication * Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix

* Name Snapshot Data Blobs by UUID * There is no functional reason why we need incremental naming for these files but * As explained in elastic#38941 it is a possible source of corrupting the repository * It wastes API calls for the list operation * Is just needless complication * Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix

* Name Snapshot Data Blobs by UUID * There is no functional reason why we need incremental naming for these files but * As explained in #38941 it is a possible source of corrupting the repository * It wastes API calls for the list operation * Is just needless complication * Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix

* Name Snapshot Data Blobs by UUID * There is no functional reason why we need incremental naming for these files but * As explained in elastic#38941 it is a possible source of corrupting the repository * It wastes API calls for the list operation * Is just needless complication * Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix

…yTests (#40893) * Add eventually consistent mock repository for reproducing and testing AWS S3 blob store behavior * Relates #38941

…yTests (elastic#40893) * Add eventually consistent mock repository for reproducing and testing AWS S3 blob store behavior * Relates elastic#38941

…yTests (#40893) (#44570) * Add eventually consistent mock repository for reproducing and testing AWS S3 blob store behavior * Relates #38941

### Changes to Root-Level index-N (RepositoryData) This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question). ### Benefits This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing). Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs. ### Only Master Deletes Blobs This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob. __No deletes are executed on the data nodes at all for any operation with this change.__ Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details). ### Why Move from index-${N} to index-${uuid} at the Shard Level This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository. This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes. Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur. Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless). With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused. Co-authored-by: Yannick Welsch <[email protected]> Co-authored-by: Tanguy Leroux <[email protected]>

This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question). This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see elastic#45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see elastic#38941 ... now only the root `index-${N}` is determined by listing). Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs. This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob. __No deletes are executed on the data nodes at all for any operation with this change.__ Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details). This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository. This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes. Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur. Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless). With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused. Co-authored-by: Yannick Welsch <[email protected]> Co-authored-by: Tanguy Leroux <[email protected]>

This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question). This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing). Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs. This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob. __No deletes are executed on the data nodes at all for any operation with this change.__ Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details). This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository. This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes. Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur. Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless). With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused. Co-authored-by: Yannick Welsch <[email protected]> Co-authored-by: Tanguy Leroux <[email protected]>

This is intended as a stop-gap solution/improvement to elastic#38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended to be backported as far as possible and motived by the recently increased chance of elastic#38941 causing trouble via SLM.

This is intended as a stop-gap solution/improvement to #38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of #38941 causing trouble via SLM (see #47520). Closes #47834 Closes #49048

This is intended as a stop-gap solution/improvement to elastic#38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of elastic#38941 causing trouble via SLM (see elastic#47520). Closes elastic#47834 Closes elastic#49048

This is intended as a stop-gap solution/improvement to #38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of #38941 causing trouble via SLM (see #47520). Closes #47834 Closes #49048

This moves the blob store repository to only use the information available in the clusterstate for loading `RepositoryData` without falling back to listing to determine a repositories' generation. Relates elastic#49729 Closes elastic#38941

original-brownbear · 2019-12-05T18:00:46Z

This is already closed by #49729 the outstanding work here in #49060 is just optimization. By tracking the latest generation in the cluster state (even if we do fall back to listing until #49060 is merged) we can not read an outdated index-N any longer :) => closed at last.

original-brownbear added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Feb 15, 2019

original-brownbear added the >bug label Feb 17, 2019

original-brownbear self-assigned this Feb 17, 2019

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 25, 2019

elastic#38941 start reproducer

9aeebca

original-brownbear mentioned this issue Mar 29, 2019

Snapshot lifecycle management #38461

Closed

19 tasks

This was referenced Mar 29, 2019

Name Snapshot Data Blobs by UUID #40652

Merged

Add Restore Operation to SnapshotResiliencyTests #40634

Merged

original-brownbear mentioned this issue Apr 5, 2019

Implement Eventually Consistent Mock Repository for SnapshotResiliencyTests #40893

Merged

This was referenced Apr 25, 2019

Name Snapshot Data Blobs by UUID (#40652) #41523

Merged

Name Snapshot Data Blobs by UUID (#40652) #41524

Merged

original-brownbear mentioned this issue Apr 25, 2019

Name Snapshot Data Blobs by UUID (#40652) #41525

Merged

original-brownbear added a commit that referenced this issue Jul 18, 2019

Implement Eventually Consistent Mock Repository for SnapshotResilienc…

78e5cbb

…yTests (#40893) * Add eventually consistent mock repository for reproducing and testing AWS S3 blob store behavior * Relates #38941

original-brownbear mentioned this issue Jul 18, 2019

Implement Eventually Consistent Mock Repository for SnapshotResiliencyTests (#40893) #44570

Merged

original-brownbear mentioned this issue Sep 4, 2019

Track Shard-Snapshot Index Generation at Repository Root #46250

Merged

original-brownbear mentioned this issue Oct 23, 2019

Handle Concurrent Repo Modification to Fix Test #48433

Merged

original-brownbear mentioned this issue Nov 11, 2019

Track Repository Gen. in BlobStoreRepository #48944

Merged

original-brownbear mentioned this issue Nov 14, 2019

Track Repository Gen. in BlobStoreRepository (#48944) #49116

Merged

original-brownbear mentioned this issue Nov 14, 2019

Track Repository Gen. in BlobStoreRepository (#48944) #49119

Merged

original-brownbear mentioned this issue Nov 18, 2019

Use ClusterState as Consistency Source for Snapshot Repositories #49060

Merged

original-brownbear closed this as completed Dec 5, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Snapshot Repository Erroneously Assumes Consistent List Operation #38941

S3 Snapshot Repository Erroneously Assumes Consistent List Operation #38941

original-brownbear commented Feb 15, 2019 •

edited

Loading

elasticmachine commented Feb 15, 2019

original-brownbear commented Feb 17, 2019 •

edited

Loading

original-brownbear commented Dec 5, 2019

S3 Snapshot Repository Erroneously Assumes Consistent List Operation #38941

S3 Snapshot Repository Erroneously Assumes Consistent List Operation #38941

Comments

original-brownbear commented Feb 15, 2019 • edited Loading

elasticmachine commented Feb 15, 2019

original-brownbear commented Feb 17, 2019 • edited Loading

original-brownbear commented Dec 5, 2019

original-brownbear commented Feb 15, 2019 •

edited

Loading

original-brownbear commented Feb 17, 2019 •

edited

Loading