-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor tasks to improve APM support #87917
Refactor tasks to improve APM support #87917
Conversation
Part of elastic#84369. Split out from elastic#87696. Rework how some work is executed by creating child tasks for them, so that when traced by APM, it results in more meaningful parent and child tasks in the UI. It also improves how Elasticsearch is modelling the work.
Pinging @elastic/es-distributed (Team:Distributed) |
@elasticmachine run elasticsearch-ci/part-1 because the failure doesn't reproduce. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good. I hope we can fix the constructor order. Also, I would like to see a bit of shallow testing that we actually utilize the task manager/generate child tasks during cluster state publishing and recovery, just to retain the functionality for the future.
server/src/main/java/org/elasticsearch/cluster/service/ClusterService.java
Outdated
Show resolved
Hide resolved
@@ -74,6 +78,8 @@ public class MasterService extends AbstractLifecycleComponent { | |||
|
|||
static final String MASTER_UPDATE_THREAD_NAME = "masterService#updateTask"; | |||
|
|||
public static final String STATE_UPDATE_ACTION_NAME = "internal:cluster/coordination/update_state"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a transport action name. In other places where we register a task, we pick a simpler name with no "internal:" prefix, for instance for enrich, it is just policy_execution
. I'd prefer the same here to avoid the confusion. publish_cluster_state_update
for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there are also examples in the same form - for example, JoinHelper
has 3 action names that all start with internal:cluster/coordination
. The internal:
prefix will also make it easier to filter out cluster management tasks when capturing traces, because we can filter in and filter out by span name, which here equates to task name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those 3 are different, they are all 3 registered as transport actions and used on the wire. A common prefix for all artificial actions would be fine though, we just have not had it. We could start one now or leave this without one. I'd prefer not to reuse the space we have. I think filtering internal:
is not going to be useful anyway, so many actions under that anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that filtering on prefixes like internal:
is probably not helpful, and indeed end-users probably don't know much about the structure of these action names. It would have been good to have all action names following the same structure as the ones for transport actions, but I think we've already crossed that bridge unfortunately. So yes a bare publish_cluster_state_update
would be ok with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the task name, and had a go at adding a test in MasterServiceTests
. Please let me know what you think, I hacked it together from other examples.
I looked into adding something similar for recoveries, but I'm a bit lost. I looked at PeerRecoverySourceServiceTests
since it's already being changed to create a task to pass in, but the existing test is very narrow in scope and it's not obvious how to problem the task-executing behaviour without lots of new setup code or by opening up method visibility, neither of which is appealing. I'm hoping someone can point somewhere and say "oh, you just need something a bit like that."
...anyone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had something like IndexRecoveryIT.testOngoingRecoveryAndMasterFailOver
in mind, i.e., ensure a recovery, then block it through transport, then check that the task is registered correctly (not doing the master failover part of the test).
I'd be happy to hack something together.
Did not look at your test yet, might be sufficient.
@henningandersen CI is green, can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Node
construction part looks good to me. I left a couple more comments.
Also, I would like to see a bit of shallow testing that we actually utilize the task manager/generate child tasks during cluster state publishing and recovery, just to retain the functionality for the future.
I did not see this added, will you look into that please?
@@ -204,6 +225,28 @@ public TransportService( | |||
@Nullable ClusterSettings clusterSettings, | |||
Set<String> taskHeaders, | |||
ConnectionManager connectionManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd get rid of both the two old constructors, but one of them is widely used in tests. This one however looks like it is only used in a couple of tests and I'd prefer to just use the one below, creating a task manager in the test.
Can we add a comment to the other constructor that accepts "task headers" that it is used in tests only?
server/src/main/java/org/elasticsearch/cluster/service/ClusterService.java
Outdated
Show resolved
Hide resolved
@@ -74,6 +78,8 @@ public class MasterService extends AbstractLifecycleComponent { | |||
|
|||
static final String MASTER_UPDATE_THREAD_NAME = "masterService#updateTask"; | |||
|
|||
public static final String STATE_UPDATE_ACTION_NAME = "internal:cluster/coordination/update_state"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those 3 are different, they are all 3 registered as transport actions and used on the wire. A common prefix for all artificial actions would be fine though, we just have not had it. We could start one now or leave this without one. I'd prefer not to reuse the space we have. I think filtering internal:
is not going to be useful anyway, so many actions under that anyway.
This class can demonstrate it works for recovery too: Click to show class
|
@henningandersen brilliant! Thank you very much for the test 🙏 Can I ask for a final approval now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the extra iterations.
Part of #84369. Split out from #87696. Rework how some work is executed
by creating child tasks for them, so that when traced by APM, it results
in more meaningful parent and child tasks in the UI. It also improves
how Elasticsearch is modelling the work.