Optimize Task Manager parent bans #51157

imotov · 2020-01-17T13:49:32Z

Reduces network traffic when cancelling parents. Instead of
broadcasting parent ban request to all nodes, we now keep track of
nodes with child tasks and only send ban requests to these nodes.

Relates to #50990

Reduces network traffic when cancelling parents. Instead of broadcasting parent ban request to all nodes, we now keep track of nodes with child tasks and only send ban requests to these nodes. Relates to elastic#50990

elasticmachine · 2020-01-17T13:49:34Z

Pinging @elastic/es-distributed (:Distributed/Task Management)

ywelsch · 2020-01-17T17:43:57Z

@tbrooks8 Can you have a look here as well? The idea is to make the banning more targeted (currently it's a broadcast to all nodes), which will help in the future where we not only want to cancel and ban immediate child tasks, but also the grand-children etc.

ywelsch

I left a few nits, and am also wondering if we should make this even more targeted (i.e. unregister nodes for which there are no pending child requests anymore).

ywelsch · 2020-01-20T15:57:54Z

...ava/org/elasticsearch/action/admin/cluster/node/tasks/cancel/TransportCancelTasksAction.java

            canceled = taskManager.cancel(cancellableTask, request.getReason(), banLock::onTaskFinished);
            if (canceled) {
                // /In case the task has some child tasks, we need to wait for until ban is set on all nodes
-                logger.trace("cancelling task {} on child nodes", cancellableTask.getId());
-                AtomicInteger responses = new AtomicInteger(childNodes.getSize());
+                logger.info("cancelling task {} on child nodes", cancellableTask.getId());


revert logging?

ywelsch · 2020-01-20T16:01:55Z

...ava/org/elasticsearch/action/admin/cluster/node/tasks/cancel/TransportCancelTasksAction.java

@@ -158,22 +161,36 @@ private void processResponse() {
        }
    }

-    private void setBanOnNodes(String reason, CancellableTask task, DiscoveryNodes nodes, ActionListener<Void> listener) {
+    private static Set<DiscoveryNode> withLocalNode(DiscoveryNode localNode, Set<DiscoveryNode> nodes) {


I don't understand this localNode business. As far as I'm aware, the cluster state's DiscoveryNodes object always contains the local node.

This also means that you can revert the other methods here to use DiscoveryNodes instead of Set<DiscoveryNode>

ywelsch · 2020-01-20T16:02:08Z

...ava/org/elasticsearch/action/admin/cluster/node/tasks/cancel/TransportCancelTasksAction.java

@@ -145,11 +147,12 @@ private void processResponse() {
                    }
                });
            }
+//            }


ywelsch · 2020-01-23T14:33:55Z

server/src/main/java/org/elasticsearch/tasks/TaskManager.java

+        public Set<DiscoveryNode> startBan() {
+            synchronized (this) {
+                if (banChildren) {
+                    throw new TaskCancelledException("The parent task was cancelled, shouldn't start any children tasks");


I wonder why we are throwing the exception here. Should this not just ignore if the flag is already set?

ywelsch · 2020-01-23T14:36:56Z

server/src/main/java/org/elasticsearch/tasks/TaskManager.java

@@ -431,6 +451,25 @@ public void waitForTaskCompletion(Task task, long untilInNanos) {
            this.task = task;
        }

+        public void registerChildNode(DiscoveryNode node) {


I wonder if we also need a method to unregister nodes after calls are completed (would need a counter per node), so that a request that successively reaches out to a lot of nodes does not need to send ban requests to many of these nodes where the requests have been completed).

imotov · 2020-01-27T14:03:52Z

Sorry for the delay. I implemented throwing an exception in startBan but now I am getting very infrequent test failures caused by a stuck task cancellation task that I cannot reproduce without running the full build a few times, and the failures disappear with logging messages. I have a suspicion that they might be caused by persistent tasks cancellation but I am not 100% sure. I am still digging.

imotov · 2020-02-03T14:43:59Z

I pushed requested changes last week except the unregistering part. I think unregistering part is going to complicate things quite a bit and I want to make sure we don't break things. If possible I would like to implement unregistring in another iteration after making sure that this iteration didn't break things.

dnhatn · 2020-03-26T23:41:37Z

Superseded by #54312.

…4312) Today when canceling a task we broadcast ban/unban requests to all nodes in the cluster. This strategy does not scale well for hierarchical cancellation. With this change, we will track outstanding child requests and broadcast the cancellation to only nodes that have outstanding child tasks. This change also prevents a parent task from sending child requests once it got canceled. Relates #50990 Supersedes #51157 Co-authored-by: Igor Motov <[email protected]> Co-authored-by: Yannick Welsch <[email protected]>

…astic#54312) Today when canceling a task we broadcast ban/unban requests to all nodes in the cluster. This strategy does not scale well for hierarchical cancellation. With this change, we will track outstanding child requests and broadcast the cancellation to only nodes that have outstanding child tasks. This change also prevents a parent task from sending child requests once it got canceled. Relates elastic#50990 Supersedes elastic#51157 Co-authored-by: Igor Motov <[email protected]> Co-authored-by: Yannick Welsch <[email protected]>

…4312) Today when canceling a task we broadcast ban/unban requests to all nodes in the cluster. This strategy does not scale well for hierarchical cancellation. With this change, we will track outstanding child requests and broadcast the cancellation to only nodes that have outstanding child tasks. This change also prevents a parent task from sending child requests once it got canceled. Relates #50990 Supersedes #51157 Co-authored-by: Igor Motov <[email protected]> Co-authored-by: Yannick Welsch <[email protected]>

Optimize Task Manager parent bans

0556bc4

Reduces network traffic when cancelling parents. Instead of broadcasting parent ban request to all nodes, we now keep track of nodes with child tasks and only send ban requests to these nodes. Relates to elastic#50990

imotov added >enhancement :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v8.0.0 v7.7.0 labels Jan 17, 2020

ywelsch requested review from ywelsch and Tim-Brooks January 17, 2020 14:45

ywelsch reviewed Jan 23, 2020

View reviewed changes

imotov added 2 commits January 29, 2020 12:04

Merge remote-tracking branch 'elastic/master' into optimize-ban-nodes

425a1ce

Address review comments

d519156

bpintea added v7.8.0 and removed v7.7.0 labels Mar 25, 2020

dnhatn mentioned this pull request Mar 26, 2020

Broadcast cancellation to only nodes have outstanding child tasks #54312

Merged

dnhatn closed this Mar 26, 2020

dnhatn removed v7.8.0 v8.0.0 labels Mar 26, 2020

imotov deleted the optimize-ban-nodes branch May 1, 2020 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Task Manager parent bans #51157

Optimize Task Manager parent bans #51157

imotov commented Jan 17, 2020

elasticmachine commented Jan 17, 2020

ywelsch commented Jan 17, 2020

ywelsch left a comment

ywelsch Jan 20, 2020

ywelsch Jan 20, 2020

ywelsch Jan 20, 2020

ywelsch Jan 23, 2020

ywelsch Jan 23, 2020

imotov commented Jan 27, 2020

imotov commented Feb 3, 2020

dnhatn commented Mar 26, 2020

@@ @@ -145,11 +147,12 @@ private void processResponse() { @@
                                   }
                               });
                           }
+              //            }

Optimize Task Manager parent bans #51157

Optimize Task Manager parent bans #51157

Conversation

imotov commented Jan 17, 2020

elasticmachine commented Jan 17, 2020

ywelsch commented Jan 17, 2020

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jan 20, 2020

Choose a reason for hiding this comment

ywelsch Jan 20, 2020

Choose a reason for hiding this comment

ywelsch Jan 20, 2020

Choose a reason for hiding this comment

ywelsch Jan 23, 2020

Choose a reason for hiding this comment

ywelsch Jan 23, 2020

Choose a reason for hiding this comment

imotov commented Jan 27, 2020

imotov commented Feb 3, 2020

dnhatn commented Mar 26, 2020