Zen2: Add leader-side join handling logic #33013

ywelsch · 2018-08-21T07:17:41Z

Adds the logic for handling joins by a prospective leader. Introduces the Coordinator class with the basic lifecycle modes (candidate, leader, follower) as well as a JoinHelper class that contains most of the plumbing for handling joins.

elasticmachine · 2018-08-21T07:17:43Z

Pinging @elastic/es-distributed

DaveCTurner

My main questions at this point are around the split between JoinHelper and Coordinator, and the use of a proper threadpool in the tests.

DaveCTurner · 2018-08-22T09:13:53Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinRequest.java

+
+    private final Optional<Join> optionalJoin;
+
+    public JoinRequest(DiscoveryNode sourceNode, Optional<Join> optionalJoin) {


If optionalJoin is present then the source node is duplicated. I think this is the neatest way to do this, but can we assert they're the same in these constructors?

fixed in 2757c07

DaveCTurner · 2018-08-22T09:14:59Z

server/src/test/java/org/elasticsearch/cluster/coordination/NodeJoinTests.java

+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+@TestLogging("org.elasticsearch.cluster.service:TRACE,org.elasticsearch.cluster.coordination:TRACE")


Do we need verbose logs here as a matter of course?

It's convenient to have those when tests with concurrency are failing. As the tests here are very contained (only invoking a few classes), enabling this will not lead to log flooding. It's useful to have this enabled by default, see e.g. other test classes in distributed land, RelocationIT, MasterDisruptionIT, ClusterDisruptionIT, ... or NodeJoinControllerTests, the Zen1 pendant to these tests.

DaveCTurner · 2018-08-22T09:32:45Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

@@ -972,7 +972,7 @@ private static String groupName(ThreadGroup threadGroup) {
     * Returns a random subset of values (including a potential empty list)
     */
    public static <T> List<T> randomSubsetOf(Collection<T> collection) {
-        return randomSubsetOf(randomInt(Math.max(collection.size() - 1, 0)), collection);
+        return randomSubsetOf(randomInt(collection.size()), collection);


Just to confirm, this is ok? We previously never returned the whole set (unless it was empty) and now we do. Nowhere expects this, right? I think the Javadoc should say it might return nothing and everything.

I think this was an actual bug, where someone assumed randomInt to have an inclusive bound instead of an exclusive one (randomIntBetween for example uses inclusive bounds). I can add more javadoc here, but what this fix now actually implements is the definition of a subset.

ignore my explanation about the bounds here. Still think it's a bug. I've enhanced javadocs in 1ce4fdc

DaveCTurner · 2018-08-22T09:40:26Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+    public static final String JOIN_ACTION_NAME = "internal:cluster/coordination/join";
+
+    private final MasterService masterService;
+    private final TransportService transportService;


This is only used in the constructor, doesn't need to be a field.

sure, not yet. The follow-up PR will add a send method to this class, so it will be of direct use then. If you feel strongly about this, I can revert.

No, that's fine.

DaveCTurner · 2018-08-22T09:46:52Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+import java.util.function.BiConsumer;
+import java.util.function.LongSupplier;
+
+public class JoinHelper extends AbstractComponent {


I'm not sure the separation of responsibilities between this and the Coordinator is in the right place. E.g. the Coordinator passes this::handleJoinRequest straight through to the constructor here solely for this node to register it as a handler for internal:cluster/coordination/join, but then handleJoinRequest does a bit of stuff and then calls back into JoinHelper. There's a lot of back-and-forth. Maybe it'd be better just to merge them - they're not tested separately, for instance?

The separation I had in mind here was that JoinHelper would be responsible for all join-related transport actions (i.e. later also have a sendJoin method + the startjoin stuff) and all MasterService-related join tasks, similar to NodeJoinController in Zen1. I would like to avoid cramming all of that into Coordinator. I'm sure that Coordinator will be bloated once we have implemented all the things we want to. The back and forth at the moment is only there for the handleJoinRequest method, which I find to be the better tradeoff if the other option is to put all of this into Coordinator.

I suggested a slightly different split in
#33013 (comment)

DaveCTurner · 2018-08-22T09:50:36Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+        };
+
+        transportService.registerRequestHandler(JOIN_ACTION_NAME, ThreadPool.Names.GENERIC, false, false, JoinRequest::new,
+            (request, channel, task) -> joinRequestHandler.accept(request, new JoinCallback() {


Could this JoinCallback implement toString() too please?

added in 4676116

DaveCTurner · 2018-08-22T09:51:24Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+        void onFailure(Exception e);
+    }
+
+    static class JoinTaskListener implements ClusterStateTaskListener {


Could this implement toString() please?

added in 4676116

DaveCTurner · 2018-08-22T10:05:51Z

server/src/test/java/org/elasticsearch/cluster/coordination/NodeJoinTests.java

+        assertFalse(isLocalNodeElectedMaster());
+    }
+
+    public void testConcurrentJoining() {


Is there a way we could test this without introducing full multithreading (and therefore possibly nondeterminism)?

I think we should also have more of these concurrent tests. I expect the non-concurrent functionality of this particular test to later be tested by "LegislatorTests".

DaveCTurner · 2018-08-22T10:09:19Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

+            }
+        }
+
+        if (prevElectionWon == false && coordState.electionWon()) {


This can't happen as a LEADER but can it happen as a FOLLOWER? I think it can't, because we must have bumped our term too. Therefore, can this be reorganised as a switch on mode?

great observation. The scenario I had in mind while writing these conditions was as follows: Assume you want to do a leader handoff. You send a startJoin to all nodes, telling them to join the new prospective leader. Assume the corresponding join from one node arrives on the prospective leader (that is still a follower) before that one has received the start join. Handling of this join with higher term will trigger ensureTermAtLeast, turn the node into a candidate, then handle the join, and then reach this point here. So yes, it sounds like we could fold this check into the CANDIDATE branch here.

I've simplified this in bd242ed

DaveCTurner · 2018-08-22T10:15:53Z

server/src/test/java/org/elasticsearch/cluster/coordination/NodeJoinTests.java

+    public void testJoinWithHigherTermElectsLeader() {
+        DiscoveryNode node0 = newNode(0, true);
+        DiscoveryNode node1 = newNode(1, true);
+        setupFakeMasterServiceAndCoordinator(1, initialState(false, node0, 1, 1,


Throughout the tests, could we pick more variable terms and versions? All these 1s and 2s are tricky to keep track of.

fixed in 7dcd656

DaveCTurner · 2018-08-22T13:40:11Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

+            }
+        }
+
+        switch (mode) {


Great. I now wonder if this whole block can move into the JoinHelper. If the code above returned its mode and called becomeLeader() then the JoinHelper should be able to work out the right things to do with the join.

I've tried, but that would make JoinHelper be aware of the mode, as well as need to call becomeLeader, so it would be calling back into Coordinator (or CoordinationState), which also feels wrong. Finally, these lines here need to happen under the mutex, so even if I move part of this over to Coordinator, it can not just selectively call into Coordinator.

I've pushed my attempt at this. I'm not sure if it's an improvement over the existing code though.

attempt is here: 6f58a8c

DaveCTurner · 2018-08-23T08:16:35Z

server/src/test/java/org/elasticsearch/cluster/coordination/NodeJoinTests.java

+        assertFalse(isLocalNodeElectedMaster());
+        joinNodeAndRun(new JoinRequest(node1, Optional.of(new Join(node1, node0, newTerm, initialTerm, initialVersion))));
+        assertTrue(isLocalNodeElectedMaster());
+        assertTrue(clusterStateHasNode(node1));


Not sure this test is strong enough - I think it should be showing that non-winning joins end up in the cluster state, but node1's join was a winning one. Perhaps assert that node0 is there, although node0 is the thing collecting the joins, so maybe we need a node2 as well?

I have added a second node in 06ab9e1

DaveCTurner · 2018-08-23T08:22:05Z

server/src/test/java/org/elasticsearch/cluster/coordination/NodeJoinTests.java

+            new VotingConfiguration(Collections.singleton(node0.getId()))));
+        long newTerm = initialTerm + randomLongBetween(1, 10);
+        coordinator.coordinationState.get().handleStartJoin(new StartJoinRequest(node1, newTerm));
+        synchronized (coordinator.mutex) {


This seems ugly, although necessary given the current infrastructure. //TODO fix this in future?

added comment in 5b390a8

DaveCTurner · 2018-08-23T08:24:17Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+
+            final String stateUpdateSource = "elected-as-master ([" + pendingAsTasks.size() + "] nodes joined)";
+
+            pendingAsTasks.put(JoinTaskExecutor.BECOME_MASTER_TASK, (source, e) -> {


Should really ask for toString()s on these handlers too, although this adds noise.

The handler is not logged anywhere, so adding toString adds little.

DaveCTurner · 2018-08-23T08:26:15Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+
+            if (justBecameLeader) {
+                joinAccumulator.submitPendingJoins();
+                joinAccumulator = new LeaderJoinAccumulator();


I see a risk here that we become leader, then become candidate, and then overwrite the accumulator here. It's ok if we becomeCandidate on each new election, but this is at least worthy of a comment.

Maybe I misunderstood you, but I think the scenario you've described is not possible. Both handleJoinRequest (where we would become leader) and this follow-up code is under the same mutex here in JoinHelper. If there was another concurrent becoming candidate in Coordinator (after it switched to leader), it would notify us, but that notification would acquire same mutex and wait for our leader transition to complete.

The Coordinator becomes leader in joinHandler.test() not in handleJoinRequest, and that's outside this mutex, so it's technically possible that it could become a candidate again before this synchronised block.

oh, right. Should we extend the mutex?

I think this race here can also lead to another odd situation:
Assume you have two concurrent join requests, both are required to become leader. The first that enters handleJoin or Coordinator will return becomeLeader = false, the second one will return true. If the second one now gets to execute this section here first, it will be submitted to the masterservice whereas the second will only be submitted in a follow-up. This means its possible that only the second node will be part of the cluster state that is published as becoming leader.

DaveCTurner · 2018-08-23T08:27:50Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+    private final Predicate<Join> joinHandler;
+    private final JoinTaskExecutor joinTaskExecutor;
+    private final Object mutex = new Object();
+    private JoinAccumulator joinAccumulator = new CandidateJoinAccumulator();


I'm undecided about whether this is too much machinery, and a simple Mode variable would be enough.

we talked about this and think it's fine

DaveCTurner · 2018-08-23T13:35:06Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+        }
+    }
+
+    interface JoinAccumulator {


This definitely feels like overkill now the JoinHelper is mode-aware and its mode is in sync with the coordinator.

DaveCTurner · 2018-08-23T13:37:51Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

+
+    public void invariant() {
+        synchronized (mutex) {
+            if (mode == Mode.LEADER) {


We can restore those assertions about the state of the join helper here - i.e. no accumulated joins when leader or follower.

DaveCTurner

Cool, LGTM

ywelsch added 9 commits August 21, 2018 09:08

Node joining

ef976f8

start testing

348fb4e

request serialozatono

554263a

more testing

00463fe

factor out into separate class

391989b

Smaller stuff

7055a70

own joincallback

36b4b39

simplify logic in handleJoinRequestUnderLock

6b59200

minor stuff

5fb85fb

ywelsch added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Aug 21, 2018

ywelsch requested a review from DaveCTurner August 21, 2018 07:17

ywelsch added 2 commits August 21, 2018 09:43

licenses

544560a

moar licenses

85a272a

ywelsch mentioned this pull request Aug 21, 2018

A new cluster coordination layer #32006

Closed

61 tasks

DaveCTurner requested changes Aug 22, 2018

View reviewed changes

ywelsch added 5 commits August 22, 2018 13:08

Add assertion about source node to join request

2757c07

enhance javadoc of randomSubsetOf

1ce4fdc

add toString()

4676116

pick more variable term and version

7dcd656

simplify cases

bd242ed

DaveCTurner reviewed Aug 22, 2018

View reviewed changes

ywelsch and others added 6 commits August 22, 2018 16:30

try moving code

6f58a8c

More separation between JoinHelper and Coordinator

a4b63f5

Cleanup

a2e9fc3

toString()

5a0efb5

rename and clean-up

aa9ce25

remove handleJoinRequest method from Coordinator

64e3cd4

checkstyle T_T

d45cc11

DaveCTurner reviewed Aug 23, 2018

View reviewed changes

ywelsch added 5 commits August 23, 2018 12:04

explicitly catch exception and handle it

49623cb

strengthen test

06ab9e1

add TODO comment to coordinator

5b390a8

Move handleJoinRequest to Coordinator

1c522ce

only use close method

66825c7

DaveCTurner reviewed Aug 23, 2018

View reviewed changes

add assertions

6566efb

DaveCTurner approved these changes Aug 23, 2018

View reviewed changes

ywelsch merged commit a0d32f5 into elastic:zen2 Aug 23, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019


		private final Optional<Join> optionalJoin;

		public JoinRequest(DiscoveryNode sourceNode, Optional<Join> optionalJoin) {


		final String stateUpdateSource = "elected-as-master ([" + pendingAsTasks.size() + "] nodes joined)";

		pendingAsTasks.put(JoinTaskExecutor.BECOME_MASTER_TASK, (source, e) -> {

Zen2: Add leader-side join handling logic #33013

Zen2: Add leader-side join handling logic #33013

Conversation

ywelsch commented Aug 21, 2018

elasticmachine commented Aug 21, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment