-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNAPSHOT: Keep SnapshotsInProgress State in Sync with Routing Table #35710
SNAPSHOT: Keep SnapshotsInProgress State in Sync with Routing Table #35710
Conversation
Pinging @elastic/es-distributed |
I think this is good for a review if you have some time. |
@ywelsch looks like it's green now :) I had to relax the overall state assertion a little though and not run it when the master was just elected (here https://github.com/elastic/elasticsearch/pull/35710/files#diff-a0853be4492c052f24917b5c1464003dR662) This is a result of the cleanup action here https://github.com/elastic/elasticsearch/pull/35710/files#diff-a0853be4492c052f24917b5c1464003dR654 not happening in this case as far as I can tell. You refactored that logic too in your branch and it's fine there but I guess we can leave that for the next round? (PS: Also ran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left mostly smaller comments on the PR. Can you also rename the PR to better reflect what's been done here, namely to keep SnapshotsInProgress state in sync with routing table?
server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/allocation/RoutingAllocation.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java
Outdated
Show resolved
Hide resolved
@@ -657,12 +659,28 @@ public void applyClusterState(ClusterChangedEvent event) { | |||
} | |||
removeFinishedSnapshotFromClusterState(event); | |||
finalizeSnapshotDeletionFromPreviousMaster(event); | |||
// TODO org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testDeleteOrphanSnapshot fails right after election here | |||
assert event.previousState().nodes().isLocalNodeElectedMaster() != false || assertConsistency(event.state()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X != false <==> X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we'll have a similar problem when transitioning from a master that did not keep the routing table in sync to a master that now does. We will probably have to schedule a clean-up task that brings these two in sync again. There's no guarantee that that task will run before any other tasks, so the assertion might not only be violated while event.previousState().nodes().isLocalNodeElectedMaster() == false
, but also on some follow-up cluster state changes. I'm wondering if we need to introduce some state to SnapshotsService to say whether we've properly cleaned up as a master and then make that assertion dependent on that boolean variable. Something we can look into in a follow-up.
server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Outdated
Show resolved
Hide resolved
ShardSnapshotStatus currentStatus = shards.get(shardId); | ||
if (currentStatus != null && currentStatus.state().completed() == false) { | ||
final ShardSnapshotStatus newStatus = Optional | ||
.ofNullable(newRoutingTable.shardRoutingTableOrNull(shardId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can expect there to always exist the corresponding shard routing table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know but can't a ShardId
that we got from the callback org.elasticsearch.cluster.routing.RoutingChangesObserver.AbstractChangedShardObserver#shardFailed
lead to a null
entry here? (I could be way off here ... not sure about the order of these things yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are never adding or removing keys in the shard routing table during the AllocationService.reroute()
phase. Can you add an assertion that the routing table is not null, but still keep the extra safety logic around?
server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Outdated
Show resolved
Hide resolved
newStatus = currentStatus; | ||
break; | ||
} | ||
} else if (currentState == State.INIT || currentStatus.state() == State.ABORTED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment that we are in the relocating state here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the assertion for that below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps move the assertion one level up, i.e.,
} else {
assert priamryShardRouting.relocating();
if (currentState == State.INIT || currentStatus.state() == State.ABORTED) {
..
} else {
..
}
}
@ywelsch all points addressed I think :) |
@ywelsch ping (if you have a second) :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 nits, looks good otherwise.
@ywelsch thanks!, nits handled => merging :) |
SnapshotsInProgress
state in sync with routing tableSnapShotsInProgress
with routing table