Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Snapshot Repository Erroneously Assumes Consistent List Operation #38941

Closed
original-brownbear opened this issue Feb 15, 2019 · 3 comments
Closed
Assignees
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Feb 15, 2019

We do the following when snapshotting a shard in https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L1177:

        public void snapshot(final IndexCommit snapshotIndexCommit) {
            logger.debug("[{}] [{}] snapshot to [{}] ...", shardId, snapshotId, metadata.name());

            final Map<String, BlobMetaData> blobs;
            try {
                blobs = blobContainer.listBlobs();

to find the latest index-{N} file at the root of each shard folder.
There is no guarantee that this file is actually going to be listed if two snapshots happen in rapid succession and some inconsistency becomes ever more likely the more snapshots one has.
If we hit the wrong N here the subsequent logic of:

                    if (existingFileInfo == null) {
                        indexIncrementalFileCount++;
                        indexIncrementalSize += md.length();
                        // create a new FileInfo
                        BlobStoreIndexShardSnapshot.FileInfo snapshotFileInfo =
                            new BlobStoreIndexShardSnapshot.FileInfo(fileNameFromGeneration(++generation), md, chunkSize());
                        indexCommitPointFiles.add(snapshotFileInfo);
                        filesToSnapshot.add(snapshotFileInfo);
                    } else {
                        indexCommitPointFiles.add(existingFileInfo);
                    }
                }

                snapshotStatus.moveToStarted(startTime, indexIncrementalFileCount,
                    indexTotalNumberOfFiles, indexIncrementalSize, indexTotalFileCount);

                for (BlobStoreIndexShardSnapshot.FileInfo snapshotFileInfo : filesToSnapshot) {
                    try {
                        snapshotFile(snapshotFileInfo);

Could produce incorrect values for generation causing shard data files to collide in naming.
Add to that the fact that the S3 repository has no failIfAlreadyExists logic in place, like the other snapshot repositories (see #36927 for details), one could overwrite shard data files in this scenario and corrupt the repository as far as I can see.

@ywelsch @tlrx maybe I'm missing some hidden step here that prevents getting the wrong N and potentially overwriting shard data files?

@original-brownbear original-brownbear added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Feb 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear
Copy link
Member Author

original-brownbear commented Feb 17, 2019

@ywelsch I'm marking this as a bug since the problem after our talk about this. I agree that this is vastly more likely in the scenario of delete and then snapshot compared to subsequent snapshots, but regardless we must find a way to remove our reliance on the list operation here to stabilize things.
I'll come up with a plan shortly.

@original-brownbear original-brownbear self-assigned this Feb 17, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 25, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 29, 2019
* There is no functional reason why we need incremental naming for these files but
  * As explained in elastic#38941 it is a possible source of corrupting the repository
  * It wastes API calls for the list operation
  * Is just needless complication
* Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations
  * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix
original-brownbear added a commit that referenced this issue Apr 2, 2019
* Name Snapshot Data Blobs by UUID

* There is no functional reason why we need incremental naming for these files but
  * As explained in #38941 it is a possible source of corrupting the repository
  * It wastes API calls for the list operation
  * Is just needless complication
* Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations
  * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 25, 2019
* Name Snapshot Data Blobs by UUID

* There is no functional reason why we need incremental naming for these files but
  * As explained in elastic#38941 it is a possible source of corrupting the repository
  * It wastes API calls for the list operation
  * Is just needless complication
* Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations
  * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 25, 2019
* Name Snapshot Data Blobs by UUID

* There is no functional reason why we need incremental naming for these files but
  * As explained in elastic#38941 it is a possible source of corrupting the repository
  * It wastes API calls for the list operation
  * Is just needless complication
* Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations
  * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 25, 2019
* Name Snapshot Data Blobs by UUID

* There is no functional reason why we need incremental naming for these files but
  * As explained in elastic#38941 it is a possible source of corrupting the repository
  * It wastes API calls for the list operation
  * Is just needless complication
* Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations
  * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix
original-brownbear added a commit that referenced this issue Apr 25, 2019
* Name Snapshot Data Blobs by UUID

* There is no functional reason why we need incremental naming for these files but
  * As explained in #38941 it is a possible source of corrupting the repository
  * It wastes API calls for the list operation
  * Is just needless complication
* Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations
  * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix
original-brownbear added a commit that referenced this issue Apr 25, 2019
* Name Snapshot Data Blobs by UUID

* There is no functional reason why we need incremental naming for these files but
  * As explained in #38941 it is a possible source of corrupting the repository
  * It wastes API calls for the list operation
  * Is just needless complication
* Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations
  * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix
original-brownbear added a commit that referenced this issue Apr 25, 2019
* Name Snapshot Data Blobs by UUID

* There is no functional reason why we need incremental naming for these files but
  * As explained in #38941 it is a possible source of corrupting the repository
  * It wastes API calls for the list operation
  * Is just needless complication
* Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations
  * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
* Name Snapshot Data Blobs by UUID

* There is no functional reason why we need incremental naming for these files but
  * As explained in elastic#38941 it is a possible source of corrupting the repository
  * It wastes API calls for the list operation
  * Is just needless complication
* Since we store the exact names of the data blobs in all the metadata anyway, we can make this change without any BwC considerations
  * Even on the worst case scenario of a downgrade the functionality would continue working since the incremental names wouldn't conflict with the uuids and the number parsing for finding the next incremental name suppresses the exception when encountring a non-numeric value after the double underscore prefix
original-brownbear added a commit that referenced this issue Jul 18, 2019
…yTests (#40893)

* Add eventually consistent mock repository for reproducing and testing AWS S3 blob store behavior
* Relates #38941
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 18, 2019
…yTests (elastic#40893)

* Add eventually consistent mock repository for reproducing and testing AWS S3 blob store behavior
* Relates elastic#38941
original-brownbear added a commit that referenced this issue Jul 18, 2019
…yTests (#40893) (#44570)

* Add eventually consistent mock repository for reproducing and testing AWS S3 blob store behavior
* Relates #38941
original-brownbear added a commit that referenced this issue Oct 22, 2019
### Changes to Root-Level index-N (RepositoryData)

This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).

### Benefits

This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing).

Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.

### Only Master Deletes Blobs

This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__

Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).

### Why Move from index-${N} to index-${uuid} at the Shard Level

This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.

Co-authored-by: Yannick Welsch <[email protected]>
Co-authored-by: Tanguy Leroux <[email protected]>
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 23, 2019
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).

This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see elastic#45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see elastic#38941 ... now only the root `index-${N}` is determined by listing).

Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.

This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__

Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).

This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.

Co-authored-by: Yannick Welsch <[email protected]>
Co-authored-by: Tanguy Leroux <[email protected]>
original-brownbear added a commit that referenced this issue Oct 23, 2019
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).

This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing).

Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.

This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__

Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).

This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.

Co-authored-by: Yannick Welsch <[email protected]>
Co-authored-by: Tanguy Leroux <[email protected]>
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 11, 2019
This is intended as a stop-gap solution/improvement to elastic#38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended to be backported as far as possible and
motived by the recently increased chance of elastic#38941 causing trouble via SLM.
original-brownbear added a commit that referenced this issue Nov 14, 2019
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941 
causing trouble via SLM (see #47520).

Closes #47834
Closes #49048
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 14, 2019
This is intended as a stop-gap solution/improvement to elastic#38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of elastic#38941
causing trouble via SLM (see elastic#47520).

Closes elastic#47834
Closes elastic#49048
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 14, 2019
This is intended as a stop-gap solution/improvement to elastic#38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of elastic#38941
causing trouble via SLM (see elastic#47520).

Closes elastic#47834
Closes elastic#49048
original-brownbear added a commit that referenced this issue Nov 15, 2019
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941
causing trouble via SLM (see #47520).

Closes #47834
Closes #49048
original-brownbear added a commit that referenced this issue Nov 15, 2019
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941
causing trouble via SLM (see #47520).

Closes #47834
Closes #49048
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 5, 2019
This moves the blob store repository to only use the information
available in the clusterstate for loading `RepositoryData` without
falling back to listing to determine a repositories' generation.

Relates elastic#49729
Closes elastic#38941
@original-brownbear
Copy link
Member Author

This is already closed by #49729 the outstanding work here in #49060 is just optimization. By tracking the latest generation in the cluster state (even if we do fall back to listing until #49060 is merged) we can not read an outdated index-N any longer :) => closed at last.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

2 participants