Fail snapshot operations early on repository corruption #30140

tlrx · 2018-04-25T16:44:20Z

Some users have reported a bug (#24477, #29649) where a NullPointerException is thrown when trying to create or to delete a snapshot in a repository:

The NullPointerException is thrown when writing a new version of the snapshots index file:

failed to finalize snapshot java.lang.NullPointerException: null
at org.elasticsearch.repositories.RepositoryData.snapshotsToXContent(RepositoryData.java:341) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeIndexGen(BlobStoreRepository.java:674) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.finalizeSnapshot(BlobStoreRepository.java:470) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]

It happens in the RepositoryData.snapshotsToXContent() method, when it writes the list of snapshots that an index belongs to:

for (final SnapshotId snapshotId : snapshotIds) {
    builder.value(snapshotId.getUUID());
}

And in this case snapshotId is null.

It took me some time to reproduce it and I finally found a scenario where this NPE can happen. Both issues where reported when moving data from a cluster in version 5.2/5.3/5.4 to a more recent version (after 5.5.0). The snapshots index file format changed in #24477 and if the repository is misused the NPE is thrown.

Let's say that the repository was created on 5.3.1 with two snapshots. The index-1 file looks like:

{
    "indices": {
        "docs": {
            "snapshots": [
                {
                    "uuid": "Asib-5XpRPqBGZNJhVV1Tw",
                    "name": "snap-1"
                },
                {
                    "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
                    "name": "snap-2"
                }
            ],
            "id": "fjfzV59tS2q-H2mtb8pdDA"
        }
    },
    "snapshots": [
        {
            "uuid": "Asib-5XpRPqBGZNJhVV1Tw",
            "name": "snap-1"
        },
        {
            "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
            "name": "snap-2"
        }
    ]
}

As written in the docs, this repository must only be used by a unique cluster in 5.3.1 version for creating or deleting snapshots. Other clusters can potentially access to this repository but they must register the repository in read-only mode.

If it's not the case, a cluster in a different recent version will write a new snapshots index file in a format that is unknown from the 5.3.1 cluster.

For example, a cluster in 5.6.9 register the same repository and creates a new snapshot. The snapshots index file is now index-2 and contains:

{
    "indices": {
        "docs": {
            "snapshots": [
                "Asib-5XpRPqBGZNJhVV1Tw",
                "kIIR10OrQsyqWsMdRRu8Bg",
                "I4SHGNFGRzqgu-iIaJMgzg"
            ],
            "id": "fjfzV59tS2q-H2mtb8pdDA"
        }
    },
    "snapshots": [
        {
            "state": 1,
            "uuid": "I4SHGNFGRzqgu-iIaJMgzg",
            "name": "snap-3"
        },
        {
            "uuid": "Asib-5XpRPqBGZNJhVV1Tw",
            "name": "snap-1"
        },
        {
            "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
            "name": "snap-2"
        }
    ]
}

The pull request #24477 changed the way each index references the snapshots it belongs to in the file. It now only contains the snapshot UUID (and not the full name/UUID).

Now, if a snapshot is deleted (or a new one is created) using the 5.3.1 cluster the index file is broken:

{
    "indices": {
        "docs": {
            "snapshots": [
                {
                    "uuid": "Asib-5XpRPqBGZNJhVV1Tw",
                    "name": "Asib-5XpRPqBGZNJhVV1Tw"
                },
                {
                    "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
                    "name": "kIIR10OrQsyqWsMdRRu8Bg"
                },
                {
                    "uuid": "I4SHGNFGRzqgu-iIaJMgzg",
                    "name": "I4SHGNFGRzqgu-iIaJMgzg"
                }
            ],
            "id": "fjfzV59tS2q-H2mtb8pdDA"
        }
    },
    "snapshots": [
        {
            "uuid": "I4SHGNFGRzqgu-iIaJMgzg",
            "name": "snap-3"
        },
        {
            "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
            "name": "snap-2"
        }
    ]
}

Here is what happened on the 5.3.1 cluster when the snap-1 has been deleted:

When parsing the index file generated by 5.6.9, the 5.3.1 cluster parsed each item of the docs.snapshots array using the SnapshotId.fromXContent() method.

This method considers that when the array item is not an object it is a snapshot name. In 5.3.1, the list of snapshots associated to the docs index became in memory 3 snapshots, and each snapshot has been initialized with a name equal to the UUID.

Then the repository data are modified in memory to remove the snapshot. The snapshot to delete is correctly removed from the list of snapshots (ie, the snapshots array at root level) but it is not removed from the snapshots array at the index level (ie, the docs.snapshots array) because the snapshot is not found: snapshot snap-1/Asib-5XpRPqBGZNJhVV1Tw is not equal to Asib-5XpRPqBGZNJhVV1Tw/Asib-5XpRPqBGZNJhVV1Tw. Then the snapshots index file index-3 is uploaded to the repository.

This index-3 file will not thrown any error when the 5.6.9 cluster reads it. When parsing the file, it correctly associates the indexdocs with the snapshots snap-2 and snap3 (because the UUID are present in the list of snapshots) but it also adds a null reference because the UUID Asib-5XpRPqBGZNJhVV1Tw is unknown.

Now, everytime a snapshot is created or deleted using the 5.6.9 cluster it will fail at the end of the process, when writing back the snapshots index file. This null reference causes a NullPointerException.

This pull request changes the parsing code of the index file so that it now throws an exception if an index references an unknown snapshot. This way it is not possible to create or delete a snapshot when the index file is broken. It also adds a test that would have failed before the change.

closes #24477

ywelsch

I've left a few comments / questions. Regarding PR title, I wonder if it should say "Fail snapshot operations early on repository corruption". This is I think what the goal of this PR is.

ywelsch · 2018-04-26T08:29:20Z

docs/reference/modules/snapshots.asciidoc

 clusters on different major versions trying to write the same repository,
-new snapshots written by one version will not be visible to the other. While
+snapshots written by one version may not be visible to the other. While


I wonder if we need to be harsher here and say that writing to the repository from an older version once it's been written to by a new version can lead to repository corruption.

ywelsch · 2018-04-26T08:51:13Z

docs/CHANGELOG.asciidoc

@@ -24,6 +24,8 @@

 === Bug Fixes

+Fix NullPointerException when creating or deleting a snapshot ({pull}30140[#30140])


can you extend this a little? e.g. Fix NullPointerException when creating or deleting a snapshot on a repository that has been written to by an older Elasticsearch after writing to it with a newer Elasticsearch version.

ywelsch · 2018-04-26T08:52:52Z

server/src/main/java/org/elasticsearch/repositories/RepositoryData.java

@@ -428,12 +423,23 @@ public static RepositoryData snapshotsFromXContent(final XContentParser parser,
                                        // since we already have the name/uuid combo in the snapshots array
                                        uuid = parser.text();
                                    }
-                                    snapshotIds.add(snapshots.get(uuid));
+                                    if (uuid != null) {


should this always be non-null? Should we barf if not?

It should be non null, but I removed the null check so that it will throw the same exception.

ywelsch · 2018-04-26T08:54:43Z

server/src/main/java/org/elasticsearch/repositories/RepositoryData.java

+                                            // A snapshotted index references a snapshot which does not exist in
+                                            // the list of snapshots. This can happen when multiple clusters in
+                                            // different versions create or delete snapshot in the same repository.
+                                            throw new ElasticsearchParseException("Unknown snapshot uuid [" + uuid


I wonder if we should extend the message to say that the repository has become corrupted. Otherwise it's not clear to the user what the problem is.

tlrx · 2018-04-26T11:10:43Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

@@ -592,10 +592,9 @@ private SnapshotInfo inProgressSnapshot(SnapshotsInProgress.Entry entry) {
     * @return map of shard id to snapshot status
     */
    public Map<ShardId, IndexShardSnapshotStatus> snapshotShards(final String repositoryName,
+                                                                 final RepositoryData repositoryData,


Note: this is a small optimization in the case of the Snapshot Status API where the repository index was loaded two times.

tlrx · 2018-04-26T11:37:39Z

Thanks @ywelsch, I updated the code.

ywelsch

LGTM. Thanks for the great analysis on this @tlrx

ywelsch · 2018-04-26T12:22:56Z

docs/CHANGELOG.asciidoc

@@ -24,6 +24,9 @@

 === Bug Fixes

+Fix NullPointerException when creating or deleting a snapshot on a repository that has been


I'm not sure that this qualifies as a bugfix, it rather enhances the failure message. I wonder if it's necessary to add this to the changelog.

I reworded this to "Fail snapshot operations early when creating or deleting a snapshot on a repository...". I still think it's valuable to have it in the CHANGELOG as the behavior is changed so I prefer to keep it.

A NullPointerException is thrown when trying to create or delete a snapshot in a repository that has been used by a pre-5.5 cluster and a post-5.5 cluster. The way snapshots are formatted in the repository snapshots index file changed in elastic#24477 and it can cause a NPE if a very specific set of operations is written in the repository.

tlrx · 2018-04-27T14:33:14Z

Thanks a lot @ywelsch

A NullPointerException is thrown when trying to create or delete a snapshot in a repository that has been written to by an older Elasticsearch after writing to it with a newer Elasticsearch version. This is because the way snapshots are formatted in the repository snapshots index file changed in #24477. This commit changes the parsing of the repository index file so that it now detects a corrupted index file and fails early the snapshot operation. closes #29052

* master: (7173 commits) Bump changelog version to 6.4 (elastic#30217) [DOCS] Adds native realm security settings (elastic#30186) Test: Switch painless test to 1 shard CCS: Drop http address from remote cluster info (elastic#29568) Reindex: Fold "from old" tests into reindex module (elastic#30142) Convert FieldCapabilitiesResponse to a ToXContentObject. (elastic#30182) [DOCS] Added 'on a single shard' to description of max_thread_count. Closes 28518 (elastic#29686) [TEST] Redirect links to new locations (elastic#30179) Move repository-s3 fixture tests to QA test project (elastic#29372) Fail snapshot operations early on repository corruption (elastic#30140) Docs: Document `failures` on reindex and friends Build global ordinals terms bucket from matching ordinals (elastic#30166) Watcher: Ensure mail message ids are unique per watch action (elastic#30112) REST: Remove GET support for clear cache indices (elastic#29525) SQL: Correct error message (elastic#30138) Require acknowledgement to start_trial license (elastic#30135) Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (elastic#30181) SQL: Add BinaryMathProcessor to named writeables list (elastic#30127) Tests: Use buildDir as base for generated-resources (elastic#30191) Fix SliceBuilderTests#testRandom failures ...

tlrx added >enhancement review :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.0.0 v6.4.0 labels Apr 25, 2018

tlrx mentioned this pull request Apr 25, 2018

Creating a snapshot fails at the very last stage with NullPointerException #29649

Closed

tlrx requested a review from ywelsch April 25, 2018 16:47

tlrx force-pushed the fix-npe-snapshots branch from c128074 to 07e645a Compare April 25, 2018 18:43

ywelsch reviewed Apr 26, 2018

View reviewed changes

tlrx commented Apr 26, 2018

View reviewed changes

tlrx changed the title ~~Fix NullPointerException when creating or deleting a snapshot~~ Fail snapshot operations early on repository corruption Apr 26, 2018

ywelsch approved these changes Apr 26, 2018

View reviewed changes

tlrx force-pushed the fix-npe-snapshots branch from 53c2e15 to 5301f33 Compare April 26, 2018 15:59

tlrx added 4 commits April 27, 2018 09:36

Update PR number

ff72347

Apply feedback

46d4748

Update Changelog

2177da2

tlrx force-pushed the fix-npe-snapshots branch from 5301f33 to 2177da2 Compare April 27, 2018 07:37

tlrx merged commit 63148dd into elastic:master Apr 27, 2018

tlrx deleted the fix-npe-snapshots branch April 27, 2018 14:33

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail snapshot operations early on repository corruption #30140

Fail snapshot operations early on repository corruption #30140

tlrx commented Apr 25, 2018 •

edited

Loading

ywelsch left a comment

ywelsch Apr 26, 2018

tlrx Apr 26, 2018

ywelsch Apr 26, 2018

tlrx Apr 26, 2018

ywelsch Apr 26, 2018

tlrx Apr 26, 2018

ywelsch Apr 26, 2018

tlrx Apr 26, 2018

tlrx commented Apr 26, 2018

ywelsch left a comment

ywelsch Apr 26, 2018

tlrx Apr 27, 2018

tlrx commented Apr 27, 2018

		@@ -24,6 +24,8 @@

		=== Bug Fixes

		Fix NullPointerException when creating or deleting a snapshot ({pull}30140[#30140])

		@@ -24,6 +24,9 @@

		=== Bug Fixes

		Fix NullPointerException when creating or deleting a snapshot on a repository that has been

Fail snapshot operations early on repository corruption #30140

Fail snapshot operations early on repository corruption #30140

Conversation

tlrx commented Apr 25, 2018 • edited Loading

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Apr 26, 2018

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Apr 27, 2018

tlrx commented Apr 25, 2018 •

edited

Loading