Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail snapshot operations early on repository corruption #30140

Merged
merged 4 commits into from
Apr 27, 2018

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Apr 25, 2018

Some users have reported a bug (#24477, #29649) where a NullPointerException is thrown when trying to create or to delete a snapshot in a repository:

The NullPointerException is thrown when writing a new version of the snapshots index file:

failed to finalize snapshot java.lang.NullPointerException: null
at org.elasticsearch.repositories.RepositoryData.snapshotsToXContent(RepositoryData.java:341) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeIndexGen(BlobStoreRepository.java:674) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.finalizeSnapshot(BlobStoreRepository.java:470) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]

It happens in the RepositoryData.snapshotsToXContent() method, when it writes the list of snapshots that an index belongs to:

for (final SnapshotId snapshotId : snapshotIds) {
    builder.value(snapshotId.getUUID());
}

And in this case snapshotId is null.

It took me some time to reproduce it and I finally found a scenario where this NPE can happen. Both issues where reported when moving data from a cluster in version 5.2/5.3/5.4 to a more recent version (after 5.5.0). The snapshots index file format changed in #24477 and if the repository is misused the NPE is thrown.

Let's say that the repository was created on 5.3.1 with two snapshots. The index-1 file looks like:

{
    "indices": {
        "docs": {
            "snapshots": [
                {
                    "uuid": "Asib-5XpRPqBGZNJhVV1Tw",
                    "name": "snap-1"
                },
                {
                    "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
                    "name": "snap-2"
                }
            ],
            "id": "fjfzV59tS2q-H2mtb8pdDA"
        }
    },
    "snapshots": [
        {
            "uuid": "Asib-5XpRPqBGZNJhVV1Tw",
            "name": "snap-1"
        },
        {
            "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
            "name": "snap-2"
        }
    ]
}

As written in the docs, this repository must only be used by a unique cluster in 5.3.1 version for creating or deleting snapshots. Other clusters can potentially access to this repository but they must register the repository in read-only mode.

If it's not the case, a cluster in a different recent version will write a new snapshots index file in a format that is unknown from the 5.3.1 cluster.

For example, a cluster in 5.6.9 register the same repository and creates a new snapshot. The snapshots index file is now index-2 and contains:

{
    "indices": {
        "docs": {
            "snapshots": [
                "Asib-5XpRPqBGZNJhVV1Tw",
                "kIIR10OrQsyqWsMdRRu8Bg",
                "I4SHGNFGRzqgu-iIaJMgzg"
            ],
            "id": "fjfzV59tS2q-H2mtb8pdDA"
        }
    },
    "snapshots": [
        {
            "state": 1,
            "uuid": "I4SHGNFGRzqgu-iIaJMgzg",
            "name": "snap-3"
        },
        {
            "uuid": "Asib-5XpRPqBGZNJhVV1Tw",
            "name": "snap-1"
        },
        {
            "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
            "name": "snap-2"
        }
    ]
}

The pull request #24477 changed the way each index references the snapshots it belongs to in the file. It now only contains the snapshot UUID (and not the full name/UUID).

Now, if a snapshot is deleted (or a new one is created) using the 5.3.1 cluster the index file is broken:

{
    "indices": {
        "docs": {
            "snapshots": [
                {
                    "uuid": "Asib-5XpRPqBGZNJhVV1Tw",
                    "name": "Asib-5XpRPqBGZNJhVV1Tw"
                },
                {
                    "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
                    "name": "kIIR10OrQsyqWsMdRRu8Bg"
                },
                {
                    "uuid": "I4SHGNFGRzqgu-iIaJMgzg",
                    "name": "I4SHGNFGRzqgu-iIaJMgzg"
                }
            ],
            "id": "fjfzV59tS2q-H2mtb8pdDA"
        }
    },
    "snapshots": [
        {
            "uuid": "I4SHGNFGRzqgu-iIaJMgzg",
            "name": "snap-3"
        },
        {
            "uuid": "kIIR10OrQsyqWsMdRRu8Bg",
            "name": "snap-2"
        }
    ]
}

Here is what happened on the 5.3.1 cluster when the snap-1 has been deleted:

When parsing the index file generated by 5.6.9, the 5.3.1 cluster parsed each item of the docs.snapshots array using the SnapshotId.fromXContent() method.

This method considers that when the array item is not an object it is a snapshot name. In 5.3.1, the list of snapshots associated to the docs index became in memory 3 snapshots, and each snapshot has been initialized with a name equal to the UUID.

Then the repository data are modified in memory to remove the snapshot. The snapshot to delete is correctly removed from the list of snapshots (ie, the snapshots array at root level) but it is not removed from the snapshots array at the index level (ie, the docs.snapshots array) because the snapshot is not found: snapshot snap-1/Asib-5XpRPqBGZNJhVV1Tw is not equal to Asib-5XpRPqBGZNJhVV1Tw/Asib-5XpRPqBGZNJhVV1Tw. Then the snapshots index file index-3 is uploaded to the repository.

This index-3 file will not thrown any error when the 5.6.9 cluster reads it. When parsing the file, it correctly associates the indexdocs with the snapshots snap-2 and snap3 (because the UUID are present in the list of snapshots) but it also adds a null reference because the UUID Asib-5XpRPqBGZNJhVV1Tw is unknown.

Now, everytime a snapshot is created or deleted using the 5.6.9 cluster it will fail at the end of the process, when writing back the snapshots index file. This null reference causes a NullPointerException.

This pull request changes the parsing code of the index file so that it now throws an exception if an index references an unknown snapshot. This way it is not possible to create or delete a snapshot when the index file is broken. It also adds a test that would have failed before the change.

closes #24477

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments / questions. Regarding PR title, I wonder if it should say "Fail snapshot operations early on repository corruption". This is I think what the goal of this PR is.

clusters on different major versions trying to write the same repository,
new snapshots written by one version will not be visible to the other. While
snapshots written by one version may not be visible to the other. While
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to be harsher here and say that writing to the repository from an older version once it's been written to by a new version can lead to repository corruption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

@@ -24,6 +24,8 @@

=== Bug Fixes

Fix NullPointerException when creating or deleting a snapshot ({pull}30140[#30140])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you extend this a little? e.g. Fix NullPointerException when creating or deleting a snapshot on a repository that has been written to by an older Elasticsearch after writing to it with a newer Elasticsearch version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -428,12 +423,23 @@ public static RepositoryData snapshotsFromXContent(final XContentParser parser,
// since we already have the name/uuid combo in the snapshots array
uuid = parser.text();
}
snapshotIds.add(snapshots.get(uuid));
if (uuid != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this always be non-null? Should we barf if not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be non null, but I removed the null check so that it will throw the same exception.

// A snapshotted index references a snapshot which does not exist in
// the list of snapshots. This can happen when multiple clusters in
// different versions create or delete snapshot in the same repository.
throw new ElasticsearchParseException("Unknown snapshot uuid [" + uuid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should extend the message to say that the repository has become corrupted. Otherwise it's not clear to the user what the problem is.

@@ -592,10 +592,9 @@ private SnapshotInfo inProgressSnapshot(SnapshotsInProgress.Entry entry) {
* @return map of shard id to snapshot status
*/
public Map<ShardId, IndexShardSnapshotStatus> snapshotShards(final String repositoryName,
final RepositoryData repositoryData,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is a small optimization in the case of the Snapshot Status API where the repository index was loaded two times.

@tlrx
Copy link
Member Author

tlrx commented Apr 26, 2018

Thanks @ywelsch, I updated the code.

@tlrx tlrx changed the title Fix NullPointerException when creating or deleting a snapshot Fail snapshot operations early on repository corruption Apr 26, 2018
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the great analysis on this @tlrx

@@ -24,6 +24,9 @@

=== Bug Fixes

Fix NullPointerException when creating or deleting a snapshot on a repository that has been
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this qualifies as a bugfix, it rather enhances the failure message. I wonder if it's necessary to add this to the changelog.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded this to "Fail snapshot operations early when creating or deleting a snapshot on a repository...". I still think it's valuable to have it in the CHANGELOG as the behavior is changed so I prefer to keep it.

A NullPointerException is thrown when trying to create or delete
a snapshot in a repository that has been used by a pre-5.5 cluster
and a post-5.5 cluster. The way snapshots are formatted in the
repository snapshots index file changed in elastic#24477 and it can cause
a NPE if a very specific set of operations is written in the repository.
@tlrx tlrx merged commit 63148dd into elastic:master Apr 27, 2018
@tlrx tlrx deleted the fix-npe-snapshots branch April 27, 2018 14:33
@tlrx
Copy link
Member Author

tlrx commented Apr 27, 2018

Thanks a lot @ywelsch

tlrx added a commit that referenced this pull request Apr 27, 2018
A NullPointerException is thrown when trying to create or delete
a snapshot in a repository that has been written to by an older
Elasticsearch after writing to it with a newer Elasticsearch version.

This is because the way snapshots are formatted in the repository
snapshots index file changed in #24477.

This commit changes the parsing of the repository index file so that
it now detects a corrupted index file and fails early the snapshot
operation.

closes #29052
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 27, 2018
* master: (7173 commits)
  Bump changelog version to 6.4 (elastic#30217)
  [DOCS] Adds native realm security settings (elastic#30186)
  Test: Switch painless test to 1 shard
  CCS: Drop http address from remote cluster info (elastic#29568)
  Reindex: Fold "from old" tests into reindex module (elastic#30142)
  Convert FieldCapabilitiesResponse to a ToXContentObject. (elastic#30182)
  [DOCS] Added 'on a single shard' to description of max_thread_count. Closes 28518 (elastic#29686)
  [TEST] Redirect links to new locations (elastic#30179)
  Move repository-s3 fixture tests to QA test project (elastic#29372)
  Fail snapshot operations early on repository corruption (elastic#30140)
  Docs: Document `failures` on reindex and friends
  Build global ordinals terms bucket from matching ordinals (elastic#30166)
  Watcher: Ensure mail message ids are unique per watch action (elastic#30112)
  REST: Remove GET support for clear cache indices (elastic#29525)
  SQL: Correct error message (elastic#30138)
  Require acknowledgement to start_trial license (elastic#30135)
  Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (elastic#30181)
  SQL: Add BinaryMathProcessor to named writeables list (elastic#30127)
  Tests: Use buildDir as base for generated-resources (elastic#30191)
  Fix SliceBuilderTests#testRandom failures
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants