-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Task Manager parent bans #51157
Conversation
Reduces network traffic when cancelling parents. Instead of broadcasting parent ban request to all nodes, we now keep track of nodes with child tasks and only send ban requests to these nodes. Relates to elastic#50990
Pinging @elastic/es-distributed (:Distributed/Task Management) |
@tbrooks8 Can you have a look here as well? The idea is to make the banning more targeted (currently it's a broadcast to all nodes), which will help in the future where we not only want to cancel and ban immediate child tasks, but also the grand-children etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few nits, and am also wondering if we should make this even more targeted (i.e. unregister nodes for which there are no pending child requests anymore).
canceled = taskManager.cancel(cancellableTask, request.getReason(), banLock::onTaskFinished); | ||
if (canceled) { | ||
// /In case the task has some child tasks, we need to wait for until ban is set on all nodes | ||
logger.trace("cancelling task {} on child nodes", cancellableTask.getId()); | ||
AtomicInteger responses = new AtomicInteger(childNodes.getSize()); | ||
logger.info("cancelling task {} on child nodes", cancellableTask.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert logging?
@@ -158,22 +161,36 @@ private void processResponse() { | |||
} | |||
} | |||
|
|||
private void setBanOnNodes(String reason, CancellableTask task, DiscoveryNodes nodes, ActionListener<Void> listener) { | |||
private static Set<DiscoveryNode> withLocalNode(DiscoveryNode localNode, Set<DiscoveryNode> nodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this localNode
business. As far as I'm aware, the cluster state's DiscoveryNodes object always contains the local node.
This also means that you can revert the other methods here to use DiscoveryNodes
instead of Set<DiscoveryNode>
@@ -145,11 +147,12 @@ private void processResponse() { | |||
} | |||
}); | |||
} | |||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
public Set<DiscoveryNode> startBan() { | ||
synchronized (this) { | ||
if (banChildren) { | ||
throw new TaskCancelledException("The parent task was cancelled, shouldn't start any children tasks"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why we are throwing the exception here. Should this not just ignore if the flag is already set?
@@ -431,6 +451,25 @@ public void waitForTaskCompletion(Task task, long untilInNanos) { | |||
this.task = task; | |||
} | |||
|
|||
public void registerChildNode(DiscoveryNode node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we also need a method to unregister nodes after calls are completed (would need a counter per node), so that a request that successively reaches out to a lot of nodes does not need to send ban requests to many of these nodes where the requests have been completed).
Sorry for the delay. I implemented throwing an exception in |
I pushed requested changes last week except the unregistering part. I think unregistering part is going to complicate things quite a bit and I want to make sure we don't break things. If possible I would like to implement unregistring in another iteration after making sure that this iteration didn't break things. |
Superseded by #54312. |
…4312) Today when canceling a task we broadcast ban/unban requests to all nodes in the cluster. This strategy does not scale well for hierarchical cancellation. With this change, we will track outstanding child requests and broadcast the cancellation to only nodes that have outstanding child tasks. This change also prevents a parent task from sending child requests once it got canceled. Relates #50990 Supersedes #51157 Co-authored-by: Igor Motov <[email protected]> Co-authored-by: Yannick Welsch <[email protected]>
…astic#54312) Today when canceling a task we broadcast ban/unban requests to all nodes in the cluster. This strategy does not scale well for hierarchical cancellation. With this change, we will track outstanding child requests and broadcast the cancellation to only nodes that have outstanding child tasks. This change also prevents a parent task from sending child requests once it got canceled. Relates elastic#50990 Supersedes elastic#51157 Co-authored-by: Igor Motov <[email protected]> Co-authored-by: Yannick Welsch <[email protected]>
…4312) Today when canceling a task we broadcast ban/unban requests to all nodes in the cluster. This strategy does not scale well for hierarchical cancellation. With this change, we will track outstanding child requests and broadcast the cancellation to only nodes that have outstanding child tasks. This change also prevents a parent task from sending child requests once it got canceled. Relates #50990 Supersedes #51157 Co-authored-by: Igor Motov <[email protected]> Co-authored-by: Yannick Welsch <[email protected]>
Reduces network traffic when cancelling parents. Instead of
broadcasting parent ban request to all nodes, we now keep track of
nodes with child tasks and only send ban requests to these nodes.
Relates to #50990