Snapshot/Restore: Ensure that shard failure reasons are correctly stored in CS #25941

imotov · 2017-07-27T20:24:09Z

The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.

Closes #25878

…red in CS The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes elastic#25878

ywelsch

I've left some minor suggestions and a question around JSON deserialization. Thanks @imotov

ywelsch · 2017-07-28T07:00:41Z

core/src/main/java/org/elasticsearch/cluster/SnapshotsInProgress.java

+                } else {
+                    String nodeId = in.readOptionalString();
+                    State shardState = State.fromValue(in.readByte());
+                    String reason = shardState.failed() ? "" : null;


can you add a comment here saying why we set reason to ""?

ywelsch · 2017-07-28T07:06:55Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotShardFailure.java

+        // Workaround for https://github.com/elastic/elasticsearch/issues/25878
+        // Some old snapshot might still have null in shard failure reasons
+        if (snapshotShardFailure.reason == null) {
+            snapshotShardFailure.reason = "";


What I don't quite understand: Why will it happily parse the reason field if it is null? Currently we parse it using text(), shouldn't that fail and should we use textOrNull() instead?

You are right, it should be textOrNull().

ywelsch · 2017-07-28T07:08:26Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

@@ -1128,7 +1128,7 @@ public ClusterState execute(ClusterState currentState) throws Exception {
                        for (ObjectObjectCursor<ShardId, ShardSnapshotStatus> shardEntry : snapshotEntry.shards()) {
                            ShardSnapshotStatus status = shardEntry.value;
                            if (!status.state().completed()) {
-                                shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED));
+                                shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED, "aborted"));


maybe extend the message to "aborted by snapshot deletion"

ywelsch · 2017-07-28T07:13:13Z

core/src/test/java/org/elasticsearch/snapshots/AbstractSnapshotIntegTestCase.java

@@ -135,6 +135,13 @@ public static String blockMasterFromFinalizingSnapshot(final String repositoryNa
        return masterName;
    }

+    public static String blockMasterFromCreatingSnapshot(final String repositoryName) {


The method name made me think that it would prevent the master from creating a snapshot at all. Maybe we can call it something along the lines of "blockMasterFromFinalizingSnapshot"?

ywelsch

LGTM

imotov · 2017-07-28T16:32:54Z

@ywelsch thanks a lot for review and help! I will let it "cook" in CI on the master branch over the weekend and then I will backport it to 5.6 and update the version in writeTo and readFrom if everything goes well.

…red in CS (#25941) The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes #25878

Updating the version in SnapshotsInProgress serialization method to reflect that #25941 was backported to 6.0.0-beta1. Relates to #25878

…red in CS (#26127) The failure reasons for snapshot shard failures might not be propagated properly if the master node changes after errors were reported by other data nodes, which causes them to be stored as null in snapshot files. This commits adds a workaround for reading such snapshot files where this information might not have been preserved and makes sure that the reason is not null if it gets cluster state from another master. This is a partial backport of #25941 to 5.6. Closes #25878

imotov added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >bug v5.6.0 v6.0.0-beta1 labels Jul 27, 2017

imotov requested a review from ywelsch July 27, 2017 20:24

ywelsch approved these changes Jul 28, 2017

View reviewed changes

Address @ywelsch's comments

aa007c1

ywelsch approved these changes Jul 28, 2017

View reviewed changes

imotov merged commit fe46ef3 into elastic:master Jul 28, 2017

imotov added the backport pending label Jul 28, 2017

imotov added v6.0.0-beta2 and removed v6.0.0-beta1 labels Aug 2, 2017

imotov added a commit that referenced this pull request Aug 3, 2017

Snapshot/Restore: Update version of shard failure reason serialization

c9bb686

Updating the version in SnapshotsInProgress serialization method to reflect that #25941 was backported to 6.0.0-beta1. Relates to #25878

imotov added v6.0.0-beta1 and removed v6.0.0-beta2 labels Aug 3, 2017

imotov mentioned this pull request Aug 9, 2017

Snapshot/Restore: fix NPE while handling null failure reasons #26127

Merged

jasontedor removed the backport pending label Sep 22, 2017

imotov deleted the issue-25878-null-in-failure-reason branch May 1, 2020 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot/Restore: Ensure that shard failure reasons are correctly stored in CS #25941

Snapshot/Restore: Ensure that shard failure reasons are correctly stored in CS #25941

imotov commented Jul 27, 2017

ywelsch left a comment

ywelsch Jul 28, 2017

ywelsch Jul 28, 2017

imotov Jul 28, 2017

ywelsch Jul 28, 2017

ywelsch Jul 28, 2017

ywelsch left a comment

imotov commented Jul 28, 2017

Snapshot/Restore: Ensure that shard failure reasons are correctly stored in CS #25941

Snapshot/Restore: Ensure that shard failure reasons are correctly stored in CS #25941

Conversation

imotov commented Jul 27, 2017

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jul 28, 2017

Choose a reason for hiding this comment

ywelsch Jul 28, 2017

Choose a reason for hiding this comment

imotov Jul 28, 2017

Choose a reason for hiding this comment

ywelsch Jul 28, 2017

Choose a reason for hiding this comment

ywelsch Jul 28, 2017

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

imotov commented Jul 28, 2017