Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new cluster coordination layer #32006

Closed
52 of 61 tasks
ywelsch opened this issue Jul 12, 2018 · 2 comments
Closed
52 of 61 tasks

A new cluster coordination layer #32006

ywelsch opened this issue Jul 12, 2018 · 2 comments
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >feature Meta resiliency v7.0.0

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Jul 12, 2018

The cluster state contains important metadata about the cluster, including what the mappings look like, what settings the indices have, which shards are allocated to which nodes, etc. Inconsistencies in the cluster state can have the most horrid consequences including inconsistent search results and data loss, and the job of the cluster state coordination subsystem is to prevent any such inconsistencies. Ideally this subsystem should also be easy to configure correctly and it should perform well in a variety of situations.

The goal of this project is to rebuild the cluster state coordination subsystem, making it more reliable, performant and user-friendly. Better reliability will be achieved by basing the core algorithm on strong theoretical underpinnings and extensive testing. Misconfiguration of the minimum_master_nodes setting, one of the most common causes for cluster state inconsistencies, will be addressed by having this property fully managed by the system itself.

We've built a prototype to validate the approach and, based on our experience with this, present the following development roadmap for this new cluster coordination and consensus layer, targeting ES 7.0:

After 7.0 FF:

Post 7.0:

  • Smoother master failovers by not exposing those to the ClusterApplierService, i.e., delay putting up a NO_MASTER_BLOCK.
  • Abdicate on leader shutdown (appoint new leader)
  • Add "has_voting_exclusions" flag to cluster health output (Add has_voting_exclusions flag to cluster health output #38568)
  • Enqueueing cluster state updates to behave as well as possible in an overloaded cluster.
  • Verify that a master which cannot write its cluster state stands down (or maybe actively abdicates)
  • Deal appropriately with duplicate nodes (see e.g. NotMasterException with duplicate node ids and minimum_master_nodes not met #32904)
  • High-level rest client integration for new APIs
  • Avoid bootstrapping if any discovered peer has a nonzero term
  • Work with support to enhance cluster diagnostics analysis tool.
@ywelsch ywelsch added >feature resiliency Meta :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jul 12, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

ywelsch added a commit that referenced this issue Aug 7, 2018
Implements the state machine on the master to publish a cluster state.

Relates to #32006
ywelsch added a commit that referenced this issue Nov 19, 2018
Zen2 is now feature-complete enough to run most ESIntegTestCase tests. The changes in this PR
are as follows:
- ClusterSettingsIT is adapted to not be Zen1 specific anymore (it was using Zen1 settings).
- Some of the integration tests require persistent storage of the cluster state, which is not fully
implemented yet (see #33958). These tests keep running with Zen1 for now but will be switched
over as soon as that is fully implemented.
- Some very few integration tests are not running yet with Zen2 for other reasons, depending on
some of the other open points in #32006.
DaveCTurner added a commit that referenced this issue Dec 20, 2018
This commit overhauls the documentation of discovery and cluster coordination,
removing mention of the Zen Discovery module and replacing it with docs for the
new cluster coordination mechanism introduced in 7.0.

Relates #32006
@andrershov andrershov self-assigned this Feb 7, 2019
@ywelsch ywelsch added the v7.0.0 label Feb 24, 2019
ywelsch added a commit that referenced this issue Feb 26, 2019
Checks that the core coordination algorithm implemented as part of Zen2 (#32006) supports
linearizable semantics. This commit adds a linearizability checker based on the Wing and Gong
graph search algorithm with support for compositional checking and activates these checks for all
CoordinatorTests.
ywelsch added a commit that referenced this issue Feb 26, 2019
Checks that the core coordination algorithm implemented as part of Zen2 (#32006) supports
linearizable semantics. This commit adds a linearizability checker based on the Wing and Gong
graph search algorithm with support for compositional checking and activates these checks for all
CoordinatorTests.
ywelsch added a commit that referenced this issue Feb 26, 2019
Checks that the core coordination algorithm implemented as part of Zen2 (#32006) supports
linearizable semantics. This commit adds a linearizability checker based on the Wing and Gong
graph search algorithm with support for compositional checking and activates these checks for all
CoordinatorTests.
@ywelsch ywelsch added v7.0.0 and removed v7.2.0 labels Apr 24, 2019
@ywelsch
Copy link
Contributor Author

ywelsch commented Apr 24, 2019

Closing this one as shipped in 7.0. Possible follow-ups will be tracked separately.

@ywelsch ywelsch closed this as completed Apr 24, 2019
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 24, 2019
The changes in elastic#32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended, but this commit adds a note to the breaking changes
docs to note that this kind of configuration is more obviously broken in recent
versions.
DaveCTurner added a commit that referenced this issue Sep 12, 2019
The changes in #32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended, but this commit adds a note to the breaking changes
docs to note that this kind of configuration is more obviously broken in recent
versions.
DaveCTurner added a commit that referenced this issue Sep 12, 2019
The changes in #32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended, but this commit adds a note to the breaking changes
docs to note that this kind of configuration is more obviously broken in recent
versions.
DaveCTurner added a commit that referenced this issue Sep 12, 2019
The changes in #32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended, but this commit adds a note to the breaking changes
docs to note that this kind of configuration is more obviously broken in recent
versions.
DaveCTurner added a commit that referenced this issue Sep 12, 2019
The changes in #32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended, but this commit adds a note to the breaking changes
docs to note that this kind of configuration is more obviously broken in recent
versions.
DaveCTurner added a commit that referenced this issue Sep 12, 2019
The changes in #32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended, but this commit adds a note to the breaking changes
docs to note that this kind of configuration is more obviously broken in recent
versions.
DaveCTurner added a commit that referenced this issue Sep 12, 2019
The changes in #32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended, but this commit adds a note to the breaking changes
docs to note that this kind of configuration is more obviously broken in recent
versions.
DaveCTurner added a commit that referenced this issue Sep 12, 2019
The changes in #32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended. This commit clarifies that only master-eligible nodes
are now involved with discovery.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 31, 2020
This resolves a longstanding TODO in the cluster coordination subsystem.

Relates elastic#32006
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 31, 2020
This commit removes a handful of TODO comments in the cluster coordination
layer that no longer apply.

Relates elastic#32006
DaveCTurner added a commit that referenced this issue Mar 31, 2020
This resolves a longstanding TODO in the cluster coordination subsystem.

Relates #32006
DaveCTurner added a commit that referenced this issue Apr 1, 2020
This resolves a longstanding TODO in the cluster coordination subsystem.

Relates #32006
DaveCTurner added a commit that referenced this issue Apr 1, 2020
This commit removes a handful of TODO comments in the cluster coordination
layer that no longer apply.

Relates #32006
DaveCTurner added a commit that referenced this issue Apr 1, 2020
This commit removes a handful of TODO comments in the cluster coordination
layer that no longer apply.

Relates #32006
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >feature Meta resiliency v7.0.0
Projects
None yet
Development

No branches or pull requests

5 participants