-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail snapshot operations early on repository corruption #30140
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments / questions. Regarding PR title, I wonder if it should say "Fail snapshot operations early on repository corruption". This is I think what the goal of this PR is.
clusters on different major versions trying to write the same repository, | ||
new snapshots written by one version will not be visible to the other. While | ||
snapshots written by one version may not be visible to the other. While |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need to be harsher here and say that writing to the repository from an older version once it's been written to by a new version can lead to repository corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
docs/CHANGELOG.asciidoc
Outdated
@@ -24,6 +24,8 @@ | |||
|
|||
=== Bug Fixes | |||
|
|||
Fix NullPointerException when creating or deleting a snapshot ({pull}30140[#30140]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you extend this a little? e.g. Fix NullPointerException when creating or deleting a snapshot on a repository that has been written to by an older Elasticsearch after writing to it with a newer Elasticsearch version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -428,12 +423,23 @@ public static RepositoryData snapshotsFromXContent(final XContentParser parser, | |||
// since we already have the name/uuid combo in the snapshots array | |||
uuid = parser.text(); | |||
} | |||
snapshotIds.add(snapshots.get(uuid)); | |||
if (uuid != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this always be non-null? Should we barf if not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be non null, but I removed the null check so that it will throw the same exception.
// A snapshotted index references a snapshot which does not exist in | ||
// the list of snapshots. This can happen when multiple clusters in | ||
// different versions create or delete snapshot in the same repository. | ||
throw new ElasticsearchParseException("Unknown snapshot uuid [" + uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should extend the message to say that the repository has become corrupted. Otherwise it's not clear to the user what the problem is.
@@ -592,10 +592,9 @@ private SnapshotInfo inProgressSnapshot(SnapshotsInProgress.Entry entry) { | |||
* @return map of shard id to snapshot status | |||
*/ | |||
public Map<ShardId, IndexShardSnapshotStatus> snapshotShards(final String repositoryName, | |||
final RepositoryData repositoryData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is a small optimization in the case of the Snapshot Status API where the repository index was loaded two times.
Thanks @ywelsch, I updated the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the great analysis on this @tlrx
docs/CHANGELOG.asciidoc
Outdated
@@ -24,6 +24,9 @@ | |||
|
|||
=== Bug Fixes | |||
|
|||
Fix NullPointerException when creating or deleting a snapshot on a repository that has been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this qualifies as a bugfix, it rather enhances the failure message. I wonder if it's necessary to add this to the changelog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworded this to "Fail snapshot operations early when creating or deleting a snapshot on a repository...". I still think it's valuable to have it in the CHANGELOG as the behavior is changed so I prefer to keep it.
A NullPointerException is thrown when trying to create or delete a snapshot in a repository that has been used by a pre-5.5 cluster and a post-5.5 cluster. The way snapshots are formatted in the repository snapshots index file changed in elastic#24477 and it can cause a NPE if a very specific set of operations is written in the repository.
Thanks a lot @ywelsch |
A NullPointerException is thrown when trying to create or delete a snapshot in a repository that has been written to by an older Elasticsearch after writing to it with a newer Elasticsearch version. This is because the way snapshots are formatted in the repository snapshots index file changed in #24477. This commit changes the parsing of the repository index file so that it now detects a corrupted index file and fails early the snapshot operation. closes #29052
* master: (7173 commits) Bump changelog version to 6.4 (elastic#30217) [DOCS] Adds native realm security settings (elastic#30186) Test: Switch painless test to 1 shard CCS: Drop http address from remote cluster info (elastic#29568) Reindex: Fold "from old" tests into reindex module (elastic#30142) Convert FieldCapabilitiesResponse to a ToXContentObject. (elastic#30182) [DOCS] Added 'on a single shard' to description of max_thread_count. Closes 28518 (elastic#29686) [TEST] Redirect links to new locations (elastic#30179) Move repository-s3 fixture tests to QA test project (elastic#29372) Fail snapshot operations early on repository corruption (elastic#30140) Docs: Document `failures` on reindex and friends Build global ordinals terms bucket from matching ordinals (elastic#30166) Watcher: Ensure mail message ids are unique per watch action (elastic#30112) REST: Remove GET support for clear cache indices (elastic#29525) SQL: Correct error message (elastic#30138) Require acknowledgement to start_trial license (elastic#30135) Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (elastic#30181) SQL: Add BinaryMathProcessor to named writeables list (elastic#30127) Tests: Use buildDir as base for generated-resources (elastic#30191) Fix SliceBuilderTests#testRandom failures ...
Some users have reported a bug (#24477, #29649) where a NullPointerException is thrown when trying to create or to delete a snapshot in a repository:
The
NullPointerException
is thrown when writing a new version of the snapshots index file:It happens in the
RepositoryData.snapshotsToXContent()
method, when it writes the list of snapshots that an index belongs to:And in this case
snapshotId
is null.It took me some time to reproduce it and I finally found a scenario where this NPE can happen. Both issues where reported when moving data from a cluster in version 5.2/5.3/5.4 to a more recent version (after 5.5.0). The snapshots index file format changed in #24477 and if the repository is misused the NPE is thrown.
Let's say that the repository was created on 5.3.1 with two snapshots. The index-1 file looks like:
As written in the docs, this repository must only be used by a unique cluster in 5.3.1 version for creating or deleting snapshots. Other clusters can potentially access to this repository but they must register the repository in read-only mode.
If it's not the case, a cluster in a different recent version will write a new snapshots index file in a format that is unknown from the 5.3.1 cluster.
For example, a cluster in 5.6.9 register the same repository and creates a new snapshot. The snapshots index file is now
index-2
and contains:The pull request #24477 changed the way each index references the snapshots it belongs to in the file. It now only contains the snapshot UUID (and not the full name/UUID).
Now, if a snapshot is deleted (or a new one is created) using the 5.3.1 cluster the index file is broken:
Here is what happened on the 5.3.1 cluster when the snap-1 has been deleted:
When parsing the index file generated by 5.6.9, the 5.3.1 cluster parsed each item of the
docs.snapshots
array using the SnapshotId.fromXContent() method.This method considers that when the array item is not an object it is a snapshot name. In 5.3.1, the list of snapshots associated to the
docs
index became in memory 3 snapshots, and each snapshot has been initialized with a name equal to the UUID.Then the repository data are modified in memory to remove the snapshot. The snapshot to delete is correctly removed from the list of snapshots (ie, the
snapshots
array at root level) but it is not removed from the snapshots array at the index level (ie, thedocs.snapshots
array) because the snapshot is not found:snapshot snap-1/Asib-5XpRPqBGZNJhVV1Tw
is not equal toAsib-5XpRPqBGZNJhVV1Tw/Asib-5XpRPqBGZNJhVV1Tw
. Then the snapshots index fileindex-3
is uploaded to the repository.This
index-3
file will not thrown any error when the 5.6.9 cluster reads it. When parsing the file, it correctly associates the indexdocs
with the snapshotssnap-2
andsnap3
(because the UUID are present in the list of snapshots) but it also adds anull
reference because the UUIDAsib-5XpRPqBGZNJhVV1Tw
is unknown.Now, everytime a snapshot is created or deleted using the 5.6.9 cluster it will fail at the end of the process, when writing back the snapshots index file. This
null
reference causes aNullPointerException
.This pull request changes the parsing code of the index file so that it now throws an exception if an index references an unknown snapshot. This way it is not possible to create or delete a snapshot when the index file is broken. It also adds a test that would have failed before the change.
closes #24477