-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Implement state recovery #36013
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments. Gateway does a little more which we have to take into account as well, e.g. upgrading and archiving unknown or invalid settings, and import indices as closed for which the index metadata has some issues.
server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayService.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/PersistedStateIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/PersistedStateIT.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/discovery/TestZenDiscovery.java
Outdated
Show resolved
Hide resolved
@ywelsch I've sufficiently reworked my PR, to address things that were missing for Zen2 ( |
run gradle build tests 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the iteration @andrershov.
server/src/main/java/org/elasticsearch/gateway/GatewayService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayService.java
Outdated
Show resolved
Hide resolved
import static org.hamcrest.Matchers.equalTo; | ||
|
||
@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0) | ||
public class PersistedStateIT extends ESIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this class is now gone, can you activate some of the tests that do full cluster restarts? For example, PersistentTasksExecutorFullRestartIT
. Find other test class that are referencing TestZenDiscovery.USE_ZEN2
and see if they have a comment about "no state persistence yet" or something similar, enable them and see if they reliably pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of tests that need to be enabled and a lot of them fail and require code changes. I've spent most time of the day to analyze failures, I suggest to address them in a follow-up PR:
Tests that pass w/o code changes:
- CreateIndexIT
- ClusterStatsIT
- FlushIT
- CorruptedFileIT
- QuorumGatewayIT
- GatewayIndexIT (except
testDanglingIndices
)
Tests that require (known) code changes to pass:
- IngestRestartIT -
IngestService.innerUpdatePipelines
should be updated. - RecoverAfterNodesIT - this tests sets autoMinMasterNodes to false and in this case auto-cluster bootstrap is not pefromed. Changes are required to the test code.
- PersistentTaskExecutorFullRestartIT -
PersistentTaskNodeService.clusterChanged
should be updated.
Tests that fail and require further investigation:
- EnableAssignmentDecidedIT - changes to
PersistentTaskNodeService.clusterChanged
are not enough, for some reason persisent tasks assignment is performed, beforeClusterSettings
reflect that assingment is disabled. - PrimaryAllocationIT
- GatewayIndexIT.testDanglingIndices
- RemoveCorruptedShardCommandIT
@ywelsch Thank you, I've made required changes, I've also commented on enabling zen2 tests, I strongly believe this should be addressed by a follow-up PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at the tests. I've asked for two tiny changes, looks good otherwise.
.nodes(DiscoveryNodes.builder().add(localNode).localNodeId(localNode.getId()).build()) | ||
.build(); | ||
public void applyClusterStateUpdaters() { | ||
assert clusterStateUpdatersApplied == false : "applyClusterStateUpdaters must only be called once"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the previousClusterState.nodes().getLocalNode() == null
to adding another mutable field to this class.
For extra safety, you can assert that transportService.getLocalNode() != null
when this is called.
@@ -196,7 +222,7 @@ public long getCurrentTerm() { | |||
|
|||
@Override | |||
public ClusterState getLastAcceptedState() { | |||
assert previousClusterState.nodes().getLocalNode() != null : "Call setLocalNode before calling this method"; | |||
assert clusterStateUpdatersApplied : "Call applyClusterStateUpdaters before calling this method"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe say here that cluster state is not fully recovered / built yet.
This PR implements proper metadata recovery for Zen2.
GatewayService
is responsible for recovery. In Zen1GatewayService
creates aninstance of
Gateway
, that is used to read out to other clusternodes, get their state and calculate the most up-to-date state based on
versions. After that
Gateway
passes thisrestored state
toGatewayService.GatewayRecoveryListener
that mixes up current stateand restored state, removes state not recovered block, creates the
routing table and performs re-routing.
For Zen2 most of these steps could be omitted because
currentState
iswhat should be used. However, Zen2 still needs to remove state not
recovered block, create routing table and perform re-routing.
This PR abstracts things to be done for recovery as a Runnable created
based on discovery type. Also
GatewayService.RecoverStateUpdateTask
class is created, which is submitted for Zen2 case. In case of Zen1,
submitted task extends
RecoverStateUpdateTask
.This PR also switches all tests that are already using Zen2 from
InMemoryPersistedState
toGatewayMetaState
.