Refactor GatewayService #99994

ywangd · 2023-09-28T06:46:32Z

This PR refactors GatewayService with the goal to make it easier to add new features.

Resolves: #89310

This PR refactors GatewayService with the goal to make it easier to add new features. Resolves: elastic#89310

elasticsearchmachine · 2023-09-28T06:46:57Z

Pinging @elastic/es-distributed (Team:Distributed)

ywangd · 2023-09-28T06:51:33Z

@DaveCTurner This is my first attempt to "clean up" the GatewayService class. I tried to work with your suggestion to use SubscribableListener and remove the boolean flags. But I am not sure whether this matches what you had planned especially around

one-shot semantics and make things sensitive to the master term

I am open to any suggestions and comments. Once we are happy with the overall approach. I also plan to add a few more tests. Thanks!

ywangd · 2023-09-28T06:53:06Z

server/src/main/java/org/elasticsearch/gateway/GatewayService.java

+        }
+
+        // This node is ready to schedule state recovery
+        thisRecoveryPlanned.addListener(new ActionListener<>() {


I wanted to use andThen but it does not expose exception to the consumer while we need the exception here to handle timeout.

DaveCTurner · 2023-09-28T07:19:40Z

Certainly nicer already yes. Just playing around with ideas here, but I'd like to try having an object which represents roughly "a pending state recovery in master term t". The master term always increases, and if we need to retry then that'll always be in a later term so we'll get a new one of these pending-state-recovery objects. Does that make sense?

Edit to add: don't try too hard to preserve the exact semantics of the timeout as it stands today - it would probably fit the implementation I suggested better if we started a new timeout when a new master is elected. That's ok IMO (indeed I think I'd prefer it).

ywangd · 2023-09-28T11:40:06Z

server/src/main/java/org/elasticsearch/gateway/GatewayService.java

+        if (pendingStateRecovery.term < currentTerm) {
+            // Always start a new state recovery if the master term changes
+            // If there is a previous one still waiting, both will run but at most one of them will
+            // actually make changes to cluster state
+            pendingStateRecovery = new PendingStateRecovery(currentTerm);


Here the code always schedule a new state recovery if it sees a new term (and the cluster state does need recovery). Not sure if this is what you mean by

I suggested better if we started a new timeout when a new master is elected

This code does not try to cancel the previous pending state recovery since it can only be a best effort. The task may have already been submitted when we try to cancel it. Please let me know if you disagree.

Yep that's the direction of which I was thinking indeed. Looking better.

I think I'd like the RecoverStateUpdateTask to remember the corresponding term with which it was registered, and verify that the term hasn't changed before it does anything. That way we don't need to worry about spurious recoveries from older terms that are no longer correct.

That's a good idea. I added the term check.

ywangd · 2023-09-28T11:41:50Z

I'd like to try having an object which represents roughly "a pending state recovery in master term t". The master term always increases, and if we need to retry then that'll always be in a later term so we'll get a new one of these pending-state-recovery objects.

I pushed 571db2d as an attempt to implement this idea. Please let me know whether it makes sense. Thanks!

DaveCTurner

I like the direction. Left a few more suggestions.

DaveCTurner · 2023-10-02T07:43:36Z

server/src/main/java/org/elasticsearch/gateway/GatewayService.java


-        if (state.nodes().isLocalNodeElectedMaster() == false) {
+        if (nodes.getMasterNodeId() == null) {


I think this is already covered by the check on nodes.isLocalNodeElectedMaster().

Yep I kept it because it has a separate logging message. I have now moved it inside isLocalNodeElectedMaster check to retain the logging message.

DaveCTurner · 2023-10-02T07:44:04Z

server/src/main/java/org/elasticsearch/gateway/GatewayService.java

-        if (state.nodes().getMasterNodeId() == null) {
-            logger.debug("not recovering from gateway, no master elected yet");
-        } else if (recoverAfterDataNodes != -1 && nodes.getDataNodes().size() < recoverAfterDataNodes) {
+        if (recoverAfterDataNodes != -1 && nodes.getDataNodes().size() < recoverAfterDataNodes) {


Maybe move this condition into the per-term check too?

I moved the check inside the new PendingStateRecovery class before scheduling the recovery. I assume this is what you mean by "per-term", it is not about checking it inside ClusterStateUpdateTask#execute. The former allows us to keep the same semantics as of today, i.e. no action at all until the required number of data nodes is met.

DaveCTurner · 2023-10-02T07:48:25Z

server/src/main/java/org/elasticsearch/gateway/GatewayService.java

+        if (pendingStateRecovery.term < currentTerm) {
+            // Always start a new state recovery if the master term changes
+            // If there is a previous one still waiting, both will run but at most one of them will
+            // actually make changes to cluster state
+            pendingStateRecovery = new PendingStateRecovery(currentTerm);


Yep that's the direction of which I was thinking indeed. Looking better.

I think I'd like the RecoverStateUpdateTask to remember the corresponding term with which it was registered, and verify that the term hasn't changed before it does anything. That way we don't need to worry about spurious recoveries from older terms that are no longer correct.

DaveCTurner · 2023-10-02T07:48:55Z

server/src/main/java/org/elasticsearch/gateway/GatewayService.java

+        }
+
+        void maybeStart(int dataNodeSize) {
+            final SubscribableListener<Void> thisRecoveryPlanned;


On reflection doing this with a SubscribableListener actually seems more awkward than just sticking with the original threadPool.schedule.

Sure I changed it to use just threadPool.schedule. It might make SubscribableListener more suitable for this use case if it does not have to use Exception to signal timeout.

ywangd · 2023-10-03T03:46:55Z

I made a rather large change based on last round's feedback. The code no longer needs synchronization. It now uses an AtomicReference to hold the current pending recovery. The task still cleans itself up when finishes either successfully and exceptionally. I personally think the code looks simpler with the latest change.

DaveCTurner · 2023-10-03T05:18:10Z

Right yes this is looking even better. But we are still doing some kinda complex state management within each term via resetState(). I think it'd be yet simpler to keep hold of the same PendingStateRecovery instance for the entire master term, and track whether we've enqueued this term's state update or not as a field within there.

DaveCTurner · 2023-10-03T05:45:55Z

I pushed a sketch of an idea to main...DaveCTurner:elasticsearch:2023/10/03/GatewayService-ideas (not tested or even necessarily fully-formed, but maybe it is useful)

ywangd · 2023-10-03T11:30:46Z

server/src/main/java/org/elasticsearch/gateway/GatewayService.java

+                runRecoveryImmediately();
+            } else if (recoverAfterTime == null) {
+                logger.debug("performing state recovery of term [{}], no delay time is configured", expectedTerm);
+                runRecoveryImmediately();
+            } else {
+                if (scheduledRecovery == null) {
+                    logger.info(
+                        "delaying initial state recovery for [{}] of term [{}]. expecting [{}] data nodes, but only have [{}]",
+                        recoverAfterTime,
+                        expectedTerm,
+                        expectedDataNodes,
+                        currentDataNodeSize
+                    );
+                    scheduledRecovery = threadPool.schedule(getScheduleTask(), recoverAfterTime, threadPool.generic());


@DaveCTurner There are still some complexities with this block of code and other related areas.

This class has no state to remember that it has attempt to recover immediately. Therefore if the cluster can recover immediately, it can potentially submit multiple cluster update tasks. Is it a problem?

For the scheduled case, we can mostly avoid multiple schedules by checking whether scheduledRecovery is null. There can still be edge cases where we can schedule more than once due to racing between checking scheduledRecovery and reset it back to null. If submitting multiple update tasks isn't an issue, we may also chose to not check it all and just always schedule?

Because we need to reset scheduledRecovery back to null in the scheduled runnable, it needs to be made volatile as well.

Do we need to address the 1st point? If so, it seems we need another state variable for it. I forgot to mention it during the sync but this was one of the original complexity. Also because the ClusterStateUpdateTask may not run inside execute due to dataNodeSize dropping again, the state needs to be reset from within the task which brings back the need of passing a "runAfter" into the task. To simplify things, I think we don't want to check dataNodeSize again inside the task. It's an edge case anyway and dropping it makes things simpler. But we will still need some other state management if we want to address the 1st point. What do you think?

Good point. I think we should only ever submit one cluster state update task per term, so we ought to track this with a flag within the per-term state.

I would not expect any races here, or rather I think if we keep track of whether we've submitted the cluster state update task then that solves those races.

Good point, although since we only do that when actually submitting the task again I think the solution is to make this submission a once-only thing.

Pushed 25bbeba to add a new state variable (AtomicBoolean) solved multiple issues that I had. Thanks!

Please let me know if the main code looks good to you. I'll proceed to add some more tests if you are happy with the main code changes. Please also let me know if you have any ideas for what kinda tests we might need. Thanks!

I have now added multiple tests to cover different scenarios in bf8f21a
Now the whole thing looks ready to me.

ywangd · 2023-10-04T03:16:26Z

@elasticmachine run elasticsearch-ci/part-1

DaveCTurner

The implementation looks good now, but I do not like the mocks-heavy tests you've added. Tests like this are very fragile and tend to obstruct other refactorings in the same area. There's no need for mocks here, we can create realistic versions of all the dependencies in ways that let us verify that this component really does run its tasks at the right time and not otherwise through their stable APIs.

I pushed a rough commit which shows how to do that. It doesn't cover everything and probably deserves some more abstraction to split it into multiple test cases but hopefully it's a useful start.

This pattern of creating a ClusterService that uses a deterministic threadpool with fake clock is used in a few other places too:

org.elasticsearch.cluster.InternalClusterInfoServiceSchedulingTests#testScheduling
org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocatorTests#testAllocate
org.elasticsearch.snapshots.SnapshotResiliencyTests.TestClusterNodes.TestClusterNode#TestClusterNode

They're not all quite the same but it feels like this should be something we can move into ClusterServiceUtils.

ywangd · 2023-10-10T07:09:02Z

server/src/main/java/org/elasticsearch/cluster/ClusterStateTaskExecutor.java

+     * @return The resulting cluster state after executing all the tasks. If {@code batchExecutionContext.initialState()} is returned then
+     * no update is published.


The real change here is just adding the missing @ sign. The other change is due to cascading line wrap.

ywangd · 2023-10-10T07:20:07Z

server/src/test/java/org/elasticsearch/gateway/GatewayServiceTests.java

+        deterministicTaskQueue.scheduleAt(
+            initialTimeInMillis + elapsed.millis(),
+            () -> setDataNodeCountTaskQueue.submitTask(randomAlphaOfLength(5), new SetDataNodeCountTask(recoverAfterNodes - 1), null)
+        );
+        deterministicTaskQueue.advanceTime();
+        deterministicTaskQueue.runAllRunnableTasks();
+
+        // The 2nd scheduled recovery when data nodes are above recoverAfterDataNodes again
+        deterministicTaskQueue.scheduleAt(
+            initialTimeInMillis + elapsed.millis() * 2,
+            () -> setDataNodeCountTaskQueue.submitTask(randomAlphaOfLength(5), new SetDataNodeCountTask(recoverAfterNodes), null)
+        );
+        deterministicTaskQueue.advanceTime();
+        deterministicTaskQueue.runAllRunnableTasks();


These advanceTime then runAllRunnableTasks() calls can be simplified if we have some method like runToTime(long) that runs all tasks up to the specified time. I can have a follow-up PR to add this method if you think it is useful.

Yep I can see value in that, please do.

DaveCTurner

Thanks Yang much nicer IMO. I left one small request about the tests, it's not essential but would be nice if you can find a way to do it without masses of duplication.

DaveCTurner · 2023-10-10T07:20:05Z

server/src/test/java/org/elasticsearch/gateway/GatewayServiceTests.java

+        // The 1st scheduled recovery may or may not run depend on what happens to the cluster next
+        final int caseNo = randomIntBetween(0, 2);
+        switch (caseNo) {


Could we test these cases as separate tests (so they all run every time)?

Yeah that's fair. I pushed 19f89ba

ywangd · 2023-10-10T08:15:22Z

It is no kidding that this issue is a great source for learning. I spent quite sometime to read through the tests you proposed and related production code. It felt productive. Thanks!

The PR is now updated accordingly. I re-organized the tests for different test purposes and chose to advace the time manually in some cases for testing cancellation. But otherwise the tests are mostly what you have suggested.

DaveCTurner

One more small observation about the tests (I could be persuaded that it's not a blocker)

DaveCTurner · 2023-10-10T09:32:55Z

server/src/test/java/org/elasticsearch/gateway/GatewayServiceTests.java

+        final var settings = settingsBuilder.build();
+        final var clusterSettings = createBuiltInClusterSettings(settings);
+
+        clusterService = new ClusterService(


Hmm it seems a bit odd to set this field here. Could we avoid storing it in a field and instead create the ClusterService in its own method, and then create the GatewayService from that?

Sure I pushed b5da2d1

An easier alternative is to access ClusterService from GatewayService. But it requires making the GatewayService.clusterService field package private (or add a package private accessor method). Though this pattern is used in many places for testing, I am not sure whether you'd like it. So I did not take this approach. Please let me know whether this would work for you since it might be useful in future.

DaveCTurner

LGTM

Relates: elastic#99994 (comment)

Relates: #99994 (comment)

Refactor GatewayService

786062a

This PR refactors GatewayService with the goal to make it easier to add new features. Resolves: elastic#89310

ywangd added >non-issue :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.11.0 labels Sep 28, 2023

ywangd requested a review from DaveCTurner September 28, 2023 06:46

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 28, 2023

ywangd commented Sep 28, 2023

View reviewed changes

address feedback to capture term

571db2d

ywangd commented Sep 28, 2023

View reviewed changes

DaveCTurner reviewed Oct 2, 2023

View reviewed changes

ywangd added 3 commits October 3, 2023 12:07

Merge remote-tracking branch 'origin/main' into es-89310-gateway-service

f4a1632

Tidy up and restructure based on comments

0a1e935

tweak

2be8f92

ywangd requested a review from DaveCTurner October 3, 2023 03:42

ywangd mentioned this pull request Oct 3, 2023

[CI] LuceneCountOperatorTests testSimple failing #100175

Closed

ywangd added 2 commits October 3, 2023 22:01

adopt proposed changes by dct

f9c0f4e

compilation

54a4ed2

ywangd commented Oct 3, 2023

View reviewed changes

ywangd added 3 commits October 3, 2023 22:53

Add task submitted tracking

25bbeba

Merge remote-tracking branch 'origin/main' into es-89310-gateway-service

9b99cf5

add tests

bf8f21a

tweak logging

18ec15c

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

ywangd added 2 commits October 6, 2023 11:13

Merge remote-tracking branch 'origin/main' into es-89310-gateway-service

4de9081

Remove usage of DiscoveryNode.createLocal. Relates: elastic#100281

da050b4

DaveCTurner reviewed Oct 8, 2023

View reviewed changes

Use real objects for tests

01cf04b

ywangd commented Oct 10, 2023

View reviewed changes

ywangd added 2 commits October 10, 2023 18:12

tweak

a48d98a

tweak

3e326e3

ywangd commented Oct 10, 2023

View reviewed changes

DaveCTurner reviewed Oct 10, 2023

View reviewed changes

ywangd added 2 commits October 10, 2023 19:09

remove case

19f89ba

Merge remote-tracking branch 'origin/main' into es-89310-gateway-service

b0d7da2

ywangd requested a review from DaveCTurner October 10, 2023 08:18

DaveCTurner reviewed Oct 10, 2023

View reviewed changes

Do not store clusterService as instance field.

b5da2d1

DaveCTurner approved these changes Oct 10, 2023

View reviewed changes

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 10, 2023

elasticsearchmachine merged commit e351c68 into elastic:main Oct 10, 2023

ywangd deleted the es-89310-gateway-service branch October 10, 2023 11:27

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Oct 11, 2023

Run tasks in the determinsitic task queue upto a cutoff time

bcfa12d

Relates: elastic#99994 (comment)

ywangd mentioned this pull request Oct 11, 2023

Run tasks in the determinsitic task queue upto a cutoff time #100655

Merged

ywangd added a commit that referenced this pull request Oct 12, 2023

Run tasks in the determinsitic task queue upto a cutoff time (#100655)

d210461

Relates: #99994 (comment)


		if (state.nodes().isLocalNodeElectedMaster() == false) {
		if (nodes.getMasterNodeId() == null) {

		* @return The resulting cluster state after executing all the tasks. If {@code batchExecutionContext.initialState()} is returned then
		* no update is published.

Refactor GatewayService #99994

Refactor GatewayService #99994

Conversation

ywangd commented Sep 28, 2023

elasticsearchmachine commented Sep 28, 2023

ywangd commented Sep 28, 2023

Choose a reason for hiding this comment

DaveCTurner commented Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywangd commented Sep 28, 2023

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywangd commented Oct 3, 2023

DaveCTurner commented Oct 3, 2023

DaveCTurner commented Oct 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywangd commented Oct 4, 2023

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywangd Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywangd commented Oct 10, 2023

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner commented Sep 28, 2023 •

edited

Loading

ywangd Oct 10, 2023 •

edited

Loading