[Zen2] Implement state recovery #36013

andrershov · 2018-11-28T17:37:48Z

This PR implements proper metadata recovery for Zen2.

GatewayServiceis responsible for recovery. In Zen1 GatewayService
creates aninstance of Gateway, that is used to read out to other cluster
nodes, get their state and calculate the most up-to-date state based on
versions. After that Gateway passes this restored state to GatewayService.GatewayRecoveryListener that mixes up current state
and restored state, removes state not recovered block, creates the
routing table and performs re-routing.

For Zen2 most of these steps could be omitted because currentState is
what should be used. However, Zen2 still needs to remove state not
recovered block, create routing table and perform re-routing.
This PR abstracts things to be done for recovery as a Runnable created
based on discovery type. Also GatewayService.RecoverStateUpdateTask
class is created, which is submitted for Zen2 case. In case of Zen1,
submitted task extends RecoverStateUpdateTask.

This PR also switches all tests that are already using Zen2 from
InMemoryPersistedState to GatewayMetaState.

elasticmachine · 2018-11-28T17:40:47Z

Pinging @elastic/es-distributed

ywelsch

I've left some comments. Gateway does a little more which we have to take into account as well, e.g. upgrading and archiving unknown or invalid settings, and import indices as closed for which the index metadata has some issues.

server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

server/src/main/java/org/elasticsearch/gateway/GatewayService.java

server/src/test/java/org/elasticsearch/cluster/coordination/PersistedStateIT.java

test/framework/src/main/java/org/elasticsearch/test/discovery/TestZenDiscovery.java

andrershov · 2018-11-30T13:34:14Z

@ywelsch I've sufficiently reworked my PR, to address things that were missing for Zen2 (upgradeAndArchiveUnknownOrInvalidSettings and close bad indices). Now there are a bunch of ClusterState updaters (essentially a function from ClusterState to ClusterState) that could be found in GatewayService.Updaters class. upgradeAndArchiveUnknownOrInvalidSettings and closeBadIndices are among them (so Gateway is no longer responsible for this work). Also, there is a recoverClusterBlock updater, which previously was inlined in GatewayRecoveryListener for Zen1 and in GatewayMetaStateService for Zen2. I think the resulting code is much cleaner.
As a side effect of this, since all of these updaters are static functions, each of them could be easily unit-tested and you can find tests in GatewayServiceTests (two of them, namely, testUpgradePersistentSettings and testUpgradeTransientSettings have migrated from GatewayTests).
Also, I've made variables final whenever possible in parts of the code that I've touched.
Could you please make the second pass?

andrershov · 2018-11-30T13:37:06Z

run gradle build tests 1

ywelsch

Thanks for the iteration @andrershov.

server/src/main/java/org/elasticsearch/gateway/GatewayService.java

ywelsch · 2018-11-30T16:20:37Z

server/src/test/java/org/elasticsearch/cluster/coordination/PersistedStateIT.java

-import static org.hamcrest.Matchers.equalTo;
-
-@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0)
-public class PersistedStateIT extends ESIntegTestCase {


As this class is now gone, can you activate some of the tests that do full cluster restarts? For example, PersistentTasksExecutorFullRestartIT. Find other test class that are referencing TestZenDiscovery.USE_ZEN2 and see if they have a comment about "no state persistence yet" or something similar, enable them and see if they reliably pass.

There are a lot of tests that need to be enabled and a lot of them fail and require code changes. I've spent most time of the day to analyze failures, I suggest to address them in a follow-up PR:

Tests that pass w/o code changes:

CreateIndexIT

ClusterStatsIT

FlushIT

CorruptedFileIT

QuorumGatewayIT

GatewayIndexIT (except testDanglingIndices)

Tests that require (known) code changes to pass:

IngestRestartIT - IngestService.innerUpdatePipelines should be updated.

RecoverAfterNodesIT - this tests sets autoMinMasterNodes to false and in this case auto-cluster bootstrap is not pefromed. Changes are required to the test code.

PersistentTaskExecutorFullRestartIT - PersistentTaskNodeService.clusterChanged should be updated.

Tests that fail and require further investigation:

EnableAssignmentDecidedIT - changes to PersistentTaskNodeService.clusterChanged are not enough, for some reason persisent tasks assignment is performed, before ClusterSettings reflect that assingment is disabled.

PrimaryAllocationIT

GatewayIndexIT.testDanglingIndices

RemoveCorruptedShardCommandIT

andrershov · 2018-12-03T20:39:30Z

@ywelsch Thank you, I've made required changes, I've also commented on enabling zen2 tests, I strongly believe this should be addressed by a follow-up PR.

ywelsch

Thanks for looking at the tests. I've asked for two tiny changes, looks good otherwise.

ywelsch · 2018-12-04T08:46:10Z

server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

-                .nodes(DiscoveryNodes.builder().add(localNode).localNodeId(localNode.getId()).build())
-                .build();
+    public void applyClusterStateUpdaters() {
+        assert clusterStateUpdatersApplied == false : "applyClusterStateUpdaters must only be called once";


I prefer the previousClusterState.nodes().getLocalNode() == null to adding another mutable field to this class.
For extra safety, you can assert that transportService.getLocalNode() != null when this is called.

ywelsch · 2018-12-04T08:46:54Z

server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

@@ -196,7 +222,7 @@ public long getCurrentTerm() {

    @Override
    public ClusterState getLastAcceptedState() {
-        assert previousClusterState.nodes().getLocalNode() != null : "Call setLocalNode before calling this method";
+        assert clusterStateUpdatersApplied : "Call applyClusterStateUpdaters before calling this method";


maybe say here that cluster state is not fully recovered / built yet.

Andrey Ershov added 2 commits November 28, 2018 18:25

Implement proper recovery for Zen2

a6f5231

Gateway is only needed for Zen1

61727a5

andrershov self-assigned this Nov 28, 2018

andrershov added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Nov 28, 2018

andrershov requested a review from ywelsch November 28, 2018 17:40

ywelsch mentioned this pull request Nov 28, 2018

A new cluster coordination layer #32006

Closed

61 tasks

ywelsch suggested changes Nov 29, 2018

View reviewed changes

ywelsch changed the title ~~[Zen2] Implement proper recovery~~ [Zen2] Implement state recovery Nov 29, 2018

Andrey Ershov added 6 commits November 29, 2018 20:11

X || X -> X

bb800a3

Add IndexMetaData blocks to cluster state

b92fee9

updated -> update

dc49d23

Remove PersistedStateIT

016b7e6

Remove InMemoryPersistedState from TestZenDiscovery

7450042

Move methods from Gateway/GatewayMetaState to GatewayService, tests

cfa18ed

Correct Mockito import and formatting

bacd8f6

ywelsch suggested changes Nov 30, 2018

View reviewed changes

Andrey Ershov added 4 commits December 3, 2018 12:50

ClusterSettings instead of ClusterService

ba9a3c9

GatewayMetaState.applyClusterStateUpdaters

b1e0aa8

logger.error -> logger.info

86597fc

Add license text to newly create test file

5804a6b

ywelsch approved these changes Dec 4, 2018

View reviewed changes

Andrey Ershov added 2 commits December 4, 2018 12:21

Remove clusterStateUpdatersApplied field

ed0caf9

Fix logger class

9566224

andrershov merged commit 35e3d77 into elastic:zen2 Dec 4, 2018

DaveCTurner mentioned this pull request Dec 31, 2018

Merge master election with state recovery in the case of a full cluster restart #14016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Zen2] Implement state recovery #36013

[Zen2] Implement state recovery #36013

andrershov commented Nov 28, 2018

elasticmachine commented Nov 28, 2018

ywelsch left a comment

andrershov commented Nov 30, 2018

andrershov commented Nov 30, 2018

ywelsch left a comment

ywelsch Nov 30, 2018

andrershov Dec 3, 2018

andrershov commented Dec 3, 2018

ywelsch left a comment

ywelsch Dec 4, 2018

ywelsch Dec 4, 2018

[Zen2] Implement state recovery #36013

[Zen2] Implement state recovery #36013

Conversation

andrershov commented Nov 28, 2018

elasticmachine commented Nov 28, 2018

ywelsch left a comment

Choose a reason for hiding this comment

andrershov commented Nov 30, 2018

andrershov commented Nov 30, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Nov 30, 2018

Choose a reason for hiding this comment

andrershov Dec 3, 2018

Choose a reason for hiding this comment

andrershov commented Dec 3, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Dec 4, 2018

Choose a reason for hiding this comment

ywelsch Dec 4, 2018

Choose a reason for hiding this comment