[Zen2] Calculate optimal cluster configuration #33924

DaveCTurner · 2018-09-20T21:20:07Z

We wish to commit a cluster state update after having received a response from
more than half of the master-eligible nodes in the cluster. This is optimal:
requiring either more or fewer votes than half harms resilience. For instance
if we have three master nodes then, we want to be able to commit a cluster
state after receiving responses from any two nodes; requiring responses from
all three is clearly not resilient to the failure of any node, and if we could
commit an update after a response from just one node then that node would be
required for every commit, which is also not resilient.

However, this means we must adjust the configuration (the set of voting nodes
in the cluster) whenever a master-eligible node joins or leaves. The
calculation of the best configuration for the cluster is the job of the
Reconfigurator, introduced here.

We wish to commit a cluster state update after having received a response from more than half of the master-eligible nodes in the cluster. This is optimal: requiring either more or fewer votes than half harms resilience. For instance if we have three master nodes then, we want to be able to commit a cluster state after receiving responses from any two nodes; requiring responses from all three is clearly not resilient to the failure of any node, and if we could commit an update after a response from just one node then that node would be required for every commit, which is also not resilient. However, this means we must adjust the configuration (the set of voting nodes in the cluster) whenever a master-eligible node joins or leaves. The calculation of the best configuration for the cluster is the job of the Reconfigurator, introduced here.

elasticmachine · 2018-09-20T21:20:09Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-09-20T21:26:06Z

server/src/main/java/org/elasticsearch/cluster/coordination/Reconfigurator.java

+     * nodes required to process a cluster state update.
+     */
+    public static final Setting<Integer> MINIMUM_VOTING_MASTER_NODES_SETTING =
+        Setting.intSetting("cluster.minimum_voting_master_nodes", 1, 1, Property.NodeScope, Property.Dynamic);


I've marked this as team-discuss to contemplate this setting. In the PoC the equivalent setting had a different meaning (it was 2*[cluster.minimum_voting_master_nodes]-1) and only really made sense as an odd number. In writing docs like the above Javadoc I've found it easier to describe this:

the size of the smallest set of master nodes required to process a cluster state update.

I also feel that the most sensible name for this setting would be something like cluster.minimum_master_nodes if only we could ignore what that name means today. 🙈

I think it makes sense to define it like that (so that it represents the level of fault-tolerance instead of something that projects majorities). Regarding the name, I want to take the weekend to think more about it

maybe we should call it cluster.global_safety_factor and use non-numeric values, e.g.,

CARELESS (maps to 1)

HEALTHY (maps to 2)

PARANOID (maps to 3)
😸

Naming the constants is a good idea; I'd be surprised if anyone ever wants a value >3 here. I'd also like to explore the idea of the default being 2 and not 1: at first I thought that this'd break clusters with fewer than 3 nodes but on reflection I can't find any obvious problems.

In-place upgrades (rolling or otherwise) seem unaffected by this setting. The trickiest case I can think of is a rolling migration for a 1- or 2-node cluster, which is not something that I've heard much about. A migration of a 1-node cluster already requires special handling (i.e. explicit retirement of the old node) so I think changing this setting is no big deal here. Migrating a 2-node cluster one-node-at-a-time will work if cluster.global_safety_factor: 2, but the resulting configuration will be three nodes (the two new nodes as well as one of the old nodes) whereas with cluster.global_safety_factor: 1 it would just be a single new node, which is slightly more resilient. Not that I think we should particularly care about resilience in 2-node clusters.

One situation where the difference might matter is with a one-node cluster to which you (accidentally or deliberately) join a second node and then remove it again. With cluster.global_safety_factor: 2 the resulting configuration is both nodes, so shutting the second node down will lose a quorum. With cluster.global_safety_factor: 1 no reconfiguration will take place and the cluster will carry on working.

…earn about rock and roll.

ywelsch · 2018-09-21T12:39:20Z

server/src/main/java/org/elasticsearch/cluster/coordination/Reconfigurator.java

+     * nodes required to process a cluster state update.
+     */
+    public static final Setting<Integer> MINIMUM_VOTING_MASTER_NODES_SETTING =
+        Setting.intSetting("cluster.minimum_voting_master_nodes", 1, 1, Property.NodeScope, Property.Dynamic);


I think it makes sense to define it like that (so that it represents the level of fault-tolerance instead of something that projects majorities). Regarding the name, I want to take the weekend to think more about it

ywelsch · 2018-09-21T13:00:35Z

server/src/main/java/org/elasticsearch/cluster/coordination/Reconfigurator.java

+     * nodes required to process a cluster state update.
+     */
+    public static final Setting<Integer> MINIMUM_VOTING_MASTER_NODES_SETTING =
+        Setting.intSetting("cluster.minimum_voting_master_nodes", 1, 1, Property.NodeScope, Property.Dynamic);


maybe we should call it cluster.global_safety_factor and use non-numeric values, e.g.,

CARELESS (maps to 1)

HEALTHY (maps to 2)

PARANOID (maps to 3)
😸

ywelsch · 2018-09-21T14:38:00Z

server/src/main/java/org/elasticsearch/cluster/coordination/Reconfigurator.java

+        nonRetiredLiveNotInConfigIds.removeAll(retiredNodeIds);
+
+        final int targetSize = Math.max(roundDownToOdd(nonRetiredLiveInConfigIds.size() + nonRetiredLiveNotInConfigIds.size()),
+            2 * minVotingMasterNodes - 1);


there is a lot going on in this formula. Can you maybe factor some of the things out into dedicated local variables to make it clearer what's going on?

Yep, I added more variables and explanation.

server/src/test/java/org/elasticsearch/cluster/coordination/ReconfiguratorTests.java

... and use an enum

Significant changes made since this review

DaveCTurner · 2018-10-03T10:13:50Z

@ywelsch I think this encapsulates all the changes we agreed on, please take another look.

DaveCTurner · 2018-10-17T07:57:56Z

@ywelsch this is good for another look

ywelsch

I've left a comment about the default and some smaller comments about the tests

ywelsch · 2018-10-17T08:46:14Z

server/src/main/java/org/elasticsearch/cluster/coordination/Reconfigurator.java

-        }
-    }
+    public static final Setting<Integer> CLUSTER_MASTER_NODES_FAILURE_TOLERANCE =
+        Setting.intSetting("cluster.master_nodes_failure_tolerance", 0, 0, Property.NodeScope, Property.Dynamic);


should the default be 1 here?

I don't think it's important because we decided to set this at bootstrapping time.

if it's not important, maybe a safer default is nicer :) I'm fine leaving as is for now. We can revisit after the bootstrapping.

ywelsch · 2018-10-17T08:52:01Z

server/src/main/java/org/elasticsearch/cluster/coordination/Reconfigurator.java

+    public static final Setting<Integer> CLUSTER_MASTER_NODES_FAILURE_TOLERANCE =
+        Setting.intSetting("cluster.master_nodes_failure_tolerance", 0, 0, Property.NodeScope, Property.Dynamic);
+
+    private int masterNodesFailureTolerance;


I think this needs to be made volatile

ywelsch · 2018-10-17T09:04:26Z

server/src/test/java/org/elasticsearch/cluster/coordination/ReconfiguratorTests.java

+            check(nodes("a", "b", "c", "d"), conf("a", "b", "c"), masterNodesFailureTolerance, conf("a", "b", "c"));
+            check(nodes("a", "b", "c", "d", "e"), conf("a", "b", "c"), masterNodesFailureTolerance, conf("a", "b", "c", "d", "e"));
+            check(nodes("a", "b"), conf("a", "b", "e"), masterNodesFailureTolerance,
+                masterNodesFailureTolerance == 1 ? conf("a", "b", "e") : conf("a"));


It's a bit weird to read these tests as we handle the config.getNodeIds().size() < 2 * masterNodesFailureTolerance + 1) case within check. It's something that you have to constantly keep in mind while reading these tests.
I would prefer to test that separately and only have checks that actually lead to the desired target configuration.
This will result in a bit more duplication, but I think that's fine.

Ok, I reworked the tests to avoid the invalid cases better.

ywelsch · 2018-10-17T09:09:51Z

server/src/test/java/org/elasticsearch/cluster/coordination/ReconfiguratorTests.java

+
+        // If the safety level was never reached then retirement can take place
+        check(nodes("a", "b"), retired("a"), conf("a"), 1, conf("b"));
+        check(nodes("a", "b"), retired("a"), conf("b"), 1, conf("b"));


these 2 tests don't make sense anymore? We never reconfigure that way. One more reason not to hide the config.getNodeIds().size() < 2 * masterNodesFailureTolerance + 1) case within check

ywelsch

LGTM

As master-eligible nodes join or leave the cluster we should give them votes or take them away, in order to maintain the optimal level of fault-tolerance in the system. elastic#33924 introduced the `Reconfigurator` to calculate the optimal configuration of the cluster, and in this change we add the plumbing needed to actually perform the reconfigurations needed as the cluster grows or shrinks.

As master-eligible nodes join or leave the cluster we should give them votes or take them away, in order to maintain the optimal level of fault-tolerance in the system. #33924 introduced the `Reconfigurator` to calculate the optimal configuration of the cluster, and in this change we add the plumbing needed to actually perform the reconfigurations needed as the cluster grows or shrinks.

DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. team-discuss labels Sep 20, 2018

DaveCTurner requested a review from ywelsch September 20, 2018 21:20

DaveCTurner commented Sep 20, 2018

View reviewed changes

ywelsch mentioned this pull request Sep 20, 2018

A new cluster coordination layer #32006

Closed

61 tasks

DaveCTurner changed the title ~~Calculate optimal cluster configuration~~ [Zen2] Calculate optimal cluster configuration Sep 20, 2018

Darnit checkstyle, you know I love you, but you've a helluva lot to l…

2a9b0c4

…earn about rock and roll.

ywelsch previously approved these changes Sep 21, 2018

View reviewed changes

DaveCTurner added 4 commits September 21, 2018 17:50

Extract variables and add comments

82d5be4

Merge branch 'zen2' into 2018-09-20-calculate-optimal-configuration

0b88e80

Treat the min voter size as a latch not a goal

b301cad

Rename setting to cluster.master_resilience_level

2616847

... and use an enum

DaveCTurner added 3 commits October 16, 2018 15:46

Merge branch 'zen2' into 2018-09-20-calculate-optimal-configuration

41594be

Change to master_nodes_failure_tolerance

7bec178

Throw AssertionError on illegal state

0e05ad8

DaveCTurner requested a review from ywelsch October 17, 2018 07:57

ywelsch suggested changes Oct 17, 2018

View reviewed changes

ywelsch removed the team-discuss label Oct 17, 2018

Review feedback

064fbbd

DaveCTurner requested a review from ywelsch October 17, 2018 11:14

ywelsch approved these changes Oct 17, 2018

View reviewed changes

Add TODO comment about default

d26133e

DaveCTurner merged commit e13ce66 into elastic:zen2 Oct 18, 2018

DaveCTurner deleted the 2018-09-20-calculate-optimal-configuration branch October 18, 2018 12:19

DaveCTurner mentioned this pull request Oct 18, 2018

[Zen2] Reconfigure cluster as its membership changes #34592

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Zen2] Calculate optimal cluster configuration #33924

[Zen2] Calculate optimal cluster configuration #33924

DaveCTurner commented Sep 20, 2018

elasticmachine commented Sep 20, 2018

DaveCTurner Sep 20, 2018 •

edited

Loading

ywelsch Sep 21, 2018

ywelsch Sep 21, 2018

DaveCTurner Sep 23, 2018 •

edited

Loading

DaveCTurner Sep 24, 2018

ywelsch Sep 21, 2018

ywelsch Sep 21, 2018

ywelsch Sep 21, 2018

DaveCTurner Sep 21, 2018

DaveCTurner commented Oct 3, 2018

DaveCTurner commented Oct 17, 2018

ywelsch left a comment

ywelsch Oct 17, 2018

DaveCTurner Oct 17, 2018

ywelsch Oct 17, 2018

ywelsch Oct 17, 2018

DaveCTurner Oct 17, 2018

ywelsch Oct 17, 2018

DaveCTurner Oct 17, 2018

ywelsch Oct 17, 2018

ywelsch left a comment

[Zen2] Calculate optimal cluster configuration #33924

[Zen2] Calculate optimal cluster configuration #33924

Conversation

DaveCTurner commented Sep 20, 2018

elasticmachine commented Sep 20, 2018

DaveCTurner Sep 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner Sep 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner commented Oct 3, 2018

DaveCTurner commented Oct 17, 2018

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

DaveCTurner Sep 20, 2018 •

edited

Loading

DaveCTurner Sep 23, 2018 •

edited

Loading