Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNAPSHOT: Keep SnapshotsInProgress State in Sync with Routing Table #35710

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Nov 19, 2018

  • Keep SnapshotsInProgress state in sync with routing table
  • Handle being unable to get shard from indicesService gracefully
  • Add assertions about inner consistency of SnapShotsInProgress with routing table

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear original-brownbear changed the title [WIP] Snapshot Resilience Improvements Snapshot Resilience Improvements Nov 20, 2018
@original-brownbear
Copy link
Member Author

@ywelsch

I think this is good for a review if you have some time.
I organized the signature of updateWithRoutingTable and the assertion on the state a little differently than in your initial approach. I like this better because you get an assertion on the full SnapshotsInProgress state instead of failing right on the first shard entry that is off => you get a little more information out of the assert message. Also, I found updateWithRoutingTable a litlle easier to digest this way, but let me know what you think :)

@ywelsch ywelsch self-requested a review November 20, 2018 10:09
@original-brownbear
Copy link
Member Author

@ywelsch looks like it's green now :)

I had to relax the overall state assertion a little though and not run it when the master was just elected (here https://github.com/elastic/elasticsearch/pull/35710/files#diff-a0853be4492c052f24917b5c1464003dR662)

This is a result of the cleanup action here https://github.com/elastic/elasticsearch/pull/35710/files#diff-a0853be4492c052f24917b5c1464003dR654 not happening in this case as far as I can tell. You refactored that logic too in your branch and it's fine there but I guess we can leave that for the next round?

(PS: Also ran :server:test :server:integTest in a loop locally for a few hours today without issues with this code :))

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left mostly smaller comments on the PR. Can you also rename the PR to better reflect what's been done here, namely to keep SnapshotsInProgress state in sync with routing table?

@@ -657,12 +659,28 @@ public void applyClusterState(ClusterChangedEvent event) {
}
removeFinishedSnapshotFromClusterState(event);
finalizeSnapshotDeletionFromPreviousMaster(event);
// TODO org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testDeleteOrphanSnapshot fails right after election here
assert event.previousState().nodes().isLocalNodeElectedMaster() != false || assertConsistency(event.state());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X != false <==> X

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we'll have a similar problem when transitioning from a master that did not keep the routing table in sync to a master that now does. We will probably have to schedule a clean-up task that brings these two in sync again. There's no guarantee that that task will run before any other tasks, so the assertion might not only be violated while event.previousState().nodes().isLocalNodeElectedMaster() == false, but also on some follow-up cluster state changes. I'm wondering if we need to introduce some state to SnapshotsService to say whether we've properly cleaned up as a master and then make that assertion dependent on that boolean variable. Something we can look into in a follow-up.

ShardSnapshotStatus currentStatus = shards.get(shardId);
if (currentStatus != null && currentStatus.state().completed() == false) {
final ShardSnapshotStatus newStatus = Optional
.ofNullable(newRoutingTable.shardRoutingTableOrNull(shardId))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can expect there to always exist the corresponding shard routing table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know but can't a ShardId that we got from the callback org.elasticsearch.cluster.routing.RoutingChangesObserver.AbstractChangedShardObserver#shardFailed lead to a null entry here? (I could be way off here ... not sure about the order of these things yet)

Copy link
Contributor

@ywelsch ywelsch Nov 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are never adding or removing keys in the shard routing table during the AllocationService.reroute() phase. Can you add an assertion that the routing table is not null, but still keep the extra safety logic around?

newStatus = currentStatus;
break;
}
} else if (currentState == State.INIT || currentStatus.state() == State.ABORTED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment that we are in the relocating state here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the assertion for that below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps move the assertion one level up, i.e.,

} else {
  assert priamryShardRouting.relocating();
  if (currentState == State.INIT || currentStatus.state() == State.ABORTED) { 
    ..
  } else {
    ..
  }
}

@original-brownbear original-brownbear changed the title Snapshot Resilience Improvements SNAPSHOT: Keep SnapshotsInProgress State in Sync with Routing Table Nov 21, 2018
@original-brownbear
Copy link
Member Author

@ywelsch all points addressed I think :)

@original-brownbear
Copy link
Member Author

@ywelsch ping (if you have a second) :)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 nits, looks good otherwise.

@original-brownbear
Copy link
Member Author

@ywelsch thanks!, nits handled => merging :)

@original-brownbear original-brownbear merged commit 2efffab into elastic:feature/snapshot-resilience Nov 22, 2018
@original-brownbear original-brownbear deleted the repro-32265 branch November 22, 2018 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants