Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot/Restore: Ensure that shard failure reasons are correctly stored in CS #25941

Merged
merged 2 commits into from
Jul 28, 2017

Conversation

imotov
Copy link
Contributor

@imotov imotov commented Jul 27, 2017

The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.

Closes #25878

…red in CS

The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.

Closes elastic#25878
@imotov imotov added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >bug v5.6.0 v6.0.0-beta1 labels Jul 27, 2017
@imotov imotov requested a review from ywelsch July 27, 2017 20:24
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some minor suggestions and a question around JSON deserialization. Thanks @imotov

} else {
String nodeId = in.readOptionalString();
State shardState = State.fromValue(in.readByte());
String reason = shardState.failed() ? "" : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here saying why we set reason to ""?

// Workaround for https://github.com/elastic/elasticsearch/issues/25878
// Some old snapshot might still have null in shard failure reasons
if (snapshotShardFailure.reason == null) {
snapshotShardFailure.reason = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't quite understand: Why will it happily parse the reason field if it is null? Currently we parse it using text(), shouldn't that fail and should we use textOrNull() instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it should be textOrNull().

@@ -1128,7 +1128,7 @@ public ClusterState execute(ClusterState currentState) throws Exception {
for (ObjectObjectCursor<ShardId, ShardSnapshotStatus> shardEntry : snapshotEntry.shards()) {
ShardSnapshotStatus status = shardEntry.value;
if (!status.state().completed()) {
shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED));
shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED, "aborted"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe extend the message to "aborted by snapshot deletion"

@@ -135,6 +135,13 @@ public static String blockMasterFromFinalizingSnapshot(final String repositoryNa
return masterName;
}

public static String blockMasterFromCreatingSnapshot(final String repositoryName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name made me think that it would prevent the master from creating a snapshot at all. Maybe we can call it something along the lines of "blockMasterFromFinalizingSnapshot"?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@imotov imotov merged commit fe46ef3 into elastic:master Jul 28, 2017
@imotov
Copy link
Contributor Author

imotov commented Jul 28, 2017

@ywelsch thanks a lot for review and help! I will let it "cook" in CI on the master branch over the weekend and then I will backport it to 5.6 and update the version in writeTo and readFrom if everything goes well.

imotov added a commit that referenced this pull request Aug 3, 2017
…red in CS (#25941)

The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.

Closes #25878
imotov added a commit that referenced this pull request Aug 3, 2017
…red in CS (#25941)

The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.

Closes #25878
imotov added a commit that referenced this pull request Aug 3, 2017
Updating the version in SnapshotsInProgress serialization method to reflect that #25941 was backported to 6.0.0-beta1.

Relates to #25878
imotov added a commit that referenced this pull request Aug 14, 2017
…red in CS (#26127)

The failure reasons for snapshot shard failures might not be propagated properly if the master node changes after errors were reported by other data nodes, which causes them to be stored as null in snapshot files. This commits adds a workaround for reading such snapshot files where this information might not have been preserved and makes sure that the reason is not null if it gets cluster state from another master. This is a partial backport of #25941 to 5.6.

Closes #25878
@imotov imotov deleted the issue-25878-null-in-failure-reason branch May 1, 2020 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v5.6.0 v6.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants