-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Calculate optimal cluster configuration #33924
[Zen2] Calculate optimal cluster configuration #33924
Conversation
We wish to commit a cluster state update after having received a response from more than half of the master-eligible nodes in the cluster. This is optimal: requiring either more or fewer votes than half harms resilience. For instance if we have three master nodes then, we want to be able to commit a cluster state after receiving responses from any two nodes; requiring responses from all three is clearly not resilient to the failure of any node, and if we could commit an update after a response from just one node then that node would be required for every commit, which is also not resilient. However, this means we must adjust the configuration (the set of voting nodes in the cluster) whenever a master-eligible node joins or leaves. The calculation of the best configuration for the cluster is the job of the Reconfigurator, introduced here.
Pinging @elastic/es-distributed |
* nodes required to process a cluster state update. | ||
*/ | ||
public static final Setting<Integer> MINIMUM_VOTING_MASTER_NODES_SETTING = | ||
Setting.intSetting("cluster.minimum_voting_master_nodes", 1, 1, Property.NodeScope, Property.Dynamic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've marked this as team-discuss
to contemplate this setting. In the PoC the equivalent setting had a different meaning (it was 2*[cluster.minimum_voting_master_nodes]-1
) and only really made sense as an odd number. In writing docs like the above Javadoc I've found it easier to describe this:
the size of the smallest set of master nodes required to process a cluster state update.
I also feel that the most sensible name for this setting would be something like cluster.minimum_master_nodes
if only we could ignore what that name means today. 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to define it like that (so that it represents the level of fault-tolerance instead of something that projects majorities). Regarding the name, I want to take the weekend to think more about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should call it cluster.global_safety_factor
and use non-numeric values, e.g.,
- CARELESS (maps to 1)
- HEALTHY (maps to 2)
- PARANOID (maps to 3)
😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming the constants is a good idea; I'd be surprised if anyone ever wants a value >3 here. I'd also like to explore the idea of the default being 2 and not 1: at first I thought that this'd break clusters with fewer than 3 nodes but on reflection I can't find any obvious problems.
In-place upgrades (rolling or otherwise) seem unaffected by this setting. The trickiest case I can think of is a rolling migration for a 1- or 2-node cluster, which is not something that I've heard much about. A migration of a 1-node cluster already requires special handling (i.e. explicit retirement of the old node) so I think changing this setting is no big deal here. Migrating a 2-node cluster one-node-at-a-time will work if cluster.global_safety_factor: 2
, but the resulting configuration will be three nodes (the two new nodes as well as one of the old nodes) whereas with cluster.global_safety_factor: 1
it would just be a single new node, which is slightly more resilient. Not that I think we should particularly care about resilience in 2-node clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One situation where the difference might matter is with a one-node cluster to which you (accidentally or deliberately) join a second node and then remove it again. With cluster.global_safety_factor: 2
the resulting configuration is both nodes, so shutting the second node down will lose a quorum. With cluster.global_safety_factor: 1
no reconfiguration will take place and the cluster will carry on working.
…earn about rock and roll.
* nodes required to process a cluster state update. | ||
*/ | ||
public static final Setting<Integer> MINIMUM_VOTING_MASTER_NODES_SETTING = | ||
Setting.intSetting("cluster.minimum_voting_master_nodes", 1, 1, Property.NodeScope, Property.Dynamic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to define it like that (so that it represents the level of fault-tolerance instead of something that projects majorities). Regarding the name, I want to take the weekend to think more about it
* nodes required to process a cluster state update. | ||
*/ | ||
public static final Setting<Integer> MINIMUM_VOTING_MASTER_NODES_SETTING = | ||
Setting.intSetting("cluster.minimum_voting_master_nodes", 1, 1, Property.NodeScope, Property.Dynamic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should call it cluster.global_safety_factor
and use non-numeric values, e.g.,
- CARELESS (maps to 1)
- HEALTHY (maps to 2)
- PARANOID (maps to 3)
😸
nonRetiredLiveNotInConfigIds.removeAll(retiredNodeIds); | ||
|
||
final int targetSize = Math.max(roundDownToOdd(nonRetiredLiveInConfigIds.size() + nonRetiredLiveNotInConfigIds.size()), | ||
2 * minVotingMasterNodes - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a lot going on in this formula. Can you maybe factor some of the things out into dedicated local variables to make it clearer what's going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I added more variables and explanation.
server/src/test/java/org/elasticsearch/cluster/coordination/ReconfiguratorTests.java
Outdated
Show resolved
Hide resolved
Significant changes made since this review
@ywelsch I think this encapsulates all the changes we agreed on, please take another look. |
@ywelsch this is good for another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a comment about the default and some smaller comments about the tests
} | ||
} | ||
public static final Setting<Integer> CLUSTER_MASTER_NODES_FAILURE_TOLERANCE = | ||
Setting.intSetting("cluster.master_nodes_failure_tolerance", 0, 0, Property.NodeScope, Property.Dynamic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the default be 1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's important because we decided to set this at bootstrapping time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's not important, maybe a safer default is nicer :) I'm fine leaving as is for now. We can revisit after the bootstrapping.
public static final Setting<Integer> CLUSTER_MASTER_NODES_FAILURE_TOLERANCE = | ||
Setting.intSetting("cluster.master_nodes_failure_tolerance", 0, 0, Property.NodeScope, Property.Dynamic); | ||
|
||
private int masterNodesFailureTolerance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be made volatile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes.
check(nodes("a", "b", "c", "d"), conf("a", "b", "c"), masterNodesFailureTolerance, conf("a", "b", "c")); | ||
check(nodes("a", "b", "c", "d", "e"), conf("a", "b", "c"), masterNodesFailureTolerance, conf("a", "b", "c", "d", "e")); | ||
check(nodes("a", "b"), conf("a", "b", "e"), masterNodesFailureTolerance, | ||
masterNodesFailureTolerance == 1 ? conf("a", "b", "e") : conf("a")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit weird to read these tests as we handle the config.getNodeIds().size() < 2 * masterNodesFailureTolerance + 1)
case within check
. It's something that you have to constantly keep in mind while reading these tests.
I would prefer to test that separately and only have check
s that actually lead to the desired target configuration.
This will result in a bit more duplication, but I think that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I reworked the tests to avoid the invalid cases better.
|
||
// If the safety level was never reached then retirement can take place | ||
check(nodes("a", "b"), retired("a"), conf("a"), 1, conf("b")); | ||
check(nodes("a", "b"), retired("a"), conf("b"), 1, conf("b")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 tests don't make sense anymore? We never reconfigure that way. One more reason not to hide the config.getNodeIds().size() < 2 * masterNodesFailureTolerance + 1)
case within check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As master-eligible nodes join or leave the cluster we should give them votes or take them away, in order to maintain the optimal level of fault-tolerance in the system. elastic#33924 introduced the `Reconfigurator` to calculate the optimal configuration of the cluster, and in this change we add the plumbing needed to actually perform the reconfigurations needed as the cluster grows or shrinks.
As master-eligible nodes join or leave the cluster we should give them votes or take them away, in order to maintain the optimal level of fault-tolerance in the system. #33924 introduced the `Reconfigurator` to calculate the optimal configuration of the cluster, and in this change we add the plumbing needed to actually perform the reconfigurations needed as the cluster grows or shrinks.
We wish to commit a cluster state update after having received a response from
more than half of the master-eligible nodes in the cluster. This is optimal:
requiring either more or fewer votes than half harms resilience. For instance
if we have three master nodes then, we want to be able to commit a cluster
state after receiving responses from any two nodes; requiring responses from
all three is clearly not resilient to the failure of any node, and if we could
commit an update after a response from just one node then that node would be
required for every commit, which is also not resilient.
However, this means we must adjust the configuration (the set of voting nodes
in the cluster) whenever a master-eligible node joins or leaves. The
calculation of the best configuration for the cluster is the job of the
Reconfigurator, introduced here.