New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

SNAPSHOT: Keep SnapshotsInProgress State in Sync with Routing Table #35710

Merged

original-brownbear merged 12 commits into elastic:feature/snapshot-resilience from original-brownbear:repro-32265

Nov 22, 2018

Member

original-brownbear commented Nov 19, 2018 •

edited

Loading

Keep SnapshotsInProgress state in sync with routing table
Handle being unable to get shard from indicesService gracefully
Add assertions about inner consistency of SnapShotsInProgress with routing table

original-brownbear added 5 commits

November 19, 2018 18:18


          Fix snapshot failure behaviour part1

cc63abe


          test works

8dcc41c

bck


          fix formatting

bece5e5


          cleanups

59fb314

original-brownbear added >bug WIP :Distributed Coordination/Snapshot/Restore v7.0.0 v6.6.0 labels

Collaborator

elasticmachine commented Nov 19, 2018

Pinging @elastic/es-distributed


          add assertions on internal consistency

ec87a9d

original-brownbear mentioned this pull request

[WIP] Keep SnapshotInProgress State in Sync with Routing Table #35637

Closed

original-brownbear changed the title ~~[WIP] Snapshot Resilience Improvements~~ Snapshot Resilience Improvements

original-brownbear added 2 commits

November 20, 2018 10:20


          Merge branch 'feature/snapshot-resilience' into repro-32265

cb4cae9


          cleanups

1d03c5d

original-brownbear removed the WIP label

Member Author

original-brownbear commented Nov 20, 2018

I think this is good for a review if you have some time.
I organized the signature of updateWithRoutingTable and the assertion on the state a little differently than in your initial approach. I like this better because you get an assertion on the full SnapshotsInProgress state instead of failing right on the first shard entry that is off => you get a little more information out of the assert message. Also, I found updateWithRoutingTable a litlle easier to digest this way, but let me know what you think :)

ywelsch self-requested a review

November 20, 2018 10:09


          relax assertions

9d96467

Member Author

original-brownbear commented Nov 20, 2018

@ywelsch looks like it's green now :)

I had to relax the overall state assertion a little though and not run it when the master was just elected (here https://github.com/elastic/elasticsearch/pull/35710/files#diff-a0853be4492c052f24917b5c1464003dR662)

This is a result of the cleanup action here https://github.com/elastic/elasticsearch/pull/35710/files#diff-a0853be4492c052f24917b5c1464003dR654 not happening in this case as far as I can tell. You refactored that logic too in your branch and it's fine there but I guess we can leave that for the next round?

(PS: Also ran :server:test :server:integTest in a loop locally for a few hours today without issues with this code :))

ywelsch suggested changes

View reviewed changes

Contributor

ywelsch left a comment

I've left mostly smaller comments on the PR. Can you also rename the PR to better reflect what's been done here, namely to keep SnapshotsInProgress state in sync with routing table?

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/cluster/routing/allocation/RoutingAllocation.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java Outdated

                               }
                               removeFinishedSnapshotFromClusterState(event);
                               finalizeSnapshotDeletionFromPreviousMaster(event);
+                              // TODO org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testDeleteOrphanSnapshot fails right after election here
+                              assert event.previousState().nodes().isLocalNodeElectedMaster() != false || assertConsistency(event.state());

Contributor

ywelsch Nov 21, 2018

X != false <==> X

Contributor

ywelsch Nov 21, 2018

Note that we'll have a similar problem when transitioning from a master that did not keep the routing table in sync to a master that now does. We will probably have to schedule a clean-up task that brings these two in sync again. There's no guarantee that that task will run before any other tasks, so the assertion might not only be violated while event.previousState().nodes().isLocalNodeElectedMaster() == false, but also on some follow-up cluster state changes. I'm wondering if we need to introduce some state to SnapshotsService to say whether we've properly cleaned up as a master and then make that assertion dependent on that boolean variable. Something we can look into in a follow-up.

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java Outdated

+                              ShardSnapshotStatus currentStatus = shards.get(shardId);
+                              if (currentStatus != null && currentStatus.state().completed() == false) {
+                                  final ShardSnapshotStatus newStatus = Optional
+                                      .ofNullable(newRoutingTable.shardRoutingTableOrNull(shardId))

Contributor

ywelsch Nov 21, 2018

I wonder if we can expect there to always exist the corresponding shard routing table?

Member Author

original-brownbear Nov 21, 2018

I don't know but can't a ShardId that we got from the callback org.elasticsearch.cluster.routing.RoutingChangesObserver.AbstractChangedShardObserver#shardFailed lead to a null entry here? (I could be way off here ... not sure about the order of these things yet)

Contributor

ywelsch Nov 22, 2018 •

edited

Loading

we are never adding or removing keys in the shard routing table during the AllocationService.reroute() phase. Can you add an assertion that the routing table is not null, but still keep the extra safety logic around?

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java Outdated

+                                  newStatus = currentStatus;
+                                  break;
+                          }
+                      } else if (currentState == State.INIT || currentStatus.state() == State.ABORTED) {

Contributor

ywelsch Nov 21, 2018

can you add a comment that we are in the relocating state here

Member Author

original-brownbear Nov 21, 2018

Added the assertion for that below.

Contributor

ywelsch Nov 22, 2018

perhaps move the assertion one level up, i.e.,

} else {
  assert priamryShardRouting.relocating();
  if (currentState == State.INIT || currentStatus.state() == State.ABORTED) { 
    ..
  } else {
    ..
  }
}

original-brownbear added 2 commits

November 21, 2018 13:53


          Merge branch 'feature/snapshot-resilience' into repro-32265

8d7a44f


          CR: comments

8a20f64

original-brownbear changed the title ~~Snapshot Resilience Improvements~~ SNAPSHOT: Keep SnapshotsInProgress State in Sync with Routing Table

Member Author

original-brownbear commented Nov 21, 2018

@ywelsch all points addressed I think :)

Member Author

original-brownbear commented Nov 22, 2018

@ywelsch ping (if you have a second) :)

ywelsch approved these changes

View reviewed changes

Contributor

ywelsch left a comment

2 nits, looks good otherwise.


          CR: nits

24fc584

Member Author

original-brownbear commented Nov 22, 2018

@ywelsch thanks!, nits handled => merging :)

original-brownbear merged commit 2efffab into elastic:feature/snapshot-resilience

original-brownbear deleted the repro-32265 branch

November 22, 2018 19:30

original-brownbear removed v6.6.0 v7.0.0 labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Snapshot/Restore