[ML] Consider cluster.remote.connect when assigning ML jobs #46025

droberts195 · 2019-08-27T13:31:59Z

Two different users have now run into a problem where ML datafeeds mysteriously fail when they are configured to use cross cluster search but they get assigned to nodes with cluster.remote.connect: false. This problem is very hard to debug.

We should take into account the value of the cluster.remote.connect setting during node assignment. A job whose associated datafeed uses cross cluster search should never be assigned to an ML node that has cluster.remote.connect: false. If this means no suitable ML nodes exist in the cluster then this will result in an error message being generated that explains the problem - it will be clear that either the ML nodes must have CCS enabled on them or else datafeeds cannot use CCS.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-08-27T13:32:01Z

Pinging @elastic/ml-core

benwtrent · 2019-08-27T15:41:26Z

@droberts195 what do you think about having this check within the Datafeed node assignment? It seems to fit more naturally there and those errors still pop-up in the UI.

droberts195 · 2019-08-27T15:55:17Z

what do you think about having this check within the Datafeed node assignment

The datafeed node assignment is currently "assign the datafeed to the same node as its associated job". (This was originally done to avoid two network trips for the extracted data - currently when the datafeed posts the extracted data to the job even though it's calling a separate endpoint that will just involve transferring data between threads in the same JVM.)

If we changed the datafeed node assignment such that it could fail after assignment of the associated job succeeded then you could get a situation where the job is open and hogging resources while the datafeed failed to start. This is another indication that we probably should have made datafeeds and anomaly detection jobs a single entity in the first place.

It would be very frustrating for a user if they had 2 ML nodes, one of which was permitted to do CCS and the other not, and we assigned a job whose datafeed required CCS to the node that was not allowed to do CCS and then errored when trying to assign the datafeed to that same node.

So I think if we did the check at the point of assigning the datafeed then we would have to relax the requirement that datafeeds get assigned to the same nodes as their associated jobs. But this would probably open up whole classes of other problems.

I'm currently working on the principle that we should treat datafeeds and their associated anomaly detection jobs as one thing and not worry about binding them ever more tightly together.

Having said that, if there's a quick win in having datafeeds that use CCS check the cluster.remote.connect setting as the first thing they do and fail fast if they find it's false on the node they've been allocated to then that's still an improvement over where we are today. It will be frustrating in the case where another ML node could have successfully run the datafeed, but not as frustrating as silently failing to achieve anything like it does today.

benwtrent · 2019-08-27T16:25:11Z

The reason I bring this up is that there will be complications adding this check for the anomaly detection job assignment. During the node assignment phase all the checks need to be synchronous and rely on the current cluster state. This poses a problem because the datafeed task is not created yet (I think) and we would have to read the configuration via the index.

Having a "fail fast" path would definitely be a win, and I can write that up fairly quickly. At least that way we don't get mysterious errors when the DF attempts to start.

droberts195 · 2019-08-28T08:42:58Z

Yes, true, the separation between jobs and datafeeds makes this very hard.

The pragmatic approach is that we document that if you want to use CCS in datafeeds then every ML node must be permitted to do CCS. Then our error path that leaves the job open with no running datafeed in the event of a user breaching this rule is not so bad.

hendrikmuhs · 2020-03-25T18:31:30Z

FWIW #53924 made this significantly easier to implement, e.g. similar implementation in transform: #54217

jasontedor · 2020-03-27T03:57:26Z

I think that this issue can be closed now?

hendrikmuhs · 2020-03-27T06:34:29Z

I think that this issue can be closed now?

I do not think so, but @droberts195 should know better. The task assigner for ML still needs to query the newly introduced role and assign jobs/datafeeds according to it iff a remote connection is required.

(FWIW: for transforms the issue is solved, we migrated to the new node role)

droberts195 · 2020-03-27T09:57:51Z

The state we are currently in is that if you have a cluster where some ML nodes can do cross cluster search and some ML nodes cannot then a datafeed that requires cross cluster search will eventually fail with this error:

elasticsearch/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/messages/Messages.java

Line 54 in 1fc0432

" Please enable node.remote_cluster_client for all machine learning nodes.";

If we think that's a good enough solution to the problem then this issue can be closed.

Transforms does better because if you have a cluster where some transform nodes can do cross cluster search and some transform nodes cannot then the node assignment will put the transforms that require cross cluster search on the transform nodes that can do it.

We could make ML do better, but it's non-trivial due to the job/datafeed split as detailed in earlier comments. We would need to get the job's datafeed early in the job assignment process, check if it needed CCS, then pass an extra boolean to the job node selector to say whether CCS was required. Then the job node selector could take this requirement into account like the transform node assignment does.

I've added the team-discuss label so we can discuss whether it's worth putting in this effort or whether the current error message is good enough.

jasontedor · 2020-03-28T13:16:19Z

Thanks for clarifying @hendrikmuhs and @droberts195.

droberts195 · 2020-04-01T14:01:18Z

We discussed this and decided that it's too much of a complication to try to support ML jobs that require cross cluster search when only a subset of ML nodes can handle them. It would make assignment much harder. If we claimed to support it then people could legitimately question why we assigned jobs that didn't require CCS to the subset of nodes that supported it and then later found ourselves unable to assign jobs requiring CCS. Therefore we will stick with the current approach of throwing an error stating that all ML nodes must allow CCS if any job uses CCS.

droberts195 added >enhancement :ml Machine learning labels Aug 27, 2019

benwtrent mentioned this issue Aug 27, 2019

[ML] Throw an error when a datafeed needs CCS but it is not enabled for the node #46044

Merged

hendrikmuhs mentioned this issue Dec 10, 2019

[Transform] improve placement of transform task targetting a remote index #50033

Closed

droberts195 added the team-discuss label Mar 27, 2020

droberts195 removed the team-discuss label Apr 1, 2020

droberts195 closed this as completed Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Consider cluster.remote.connect when assigning ML jobs #46025

[ML] Consider cluster.remote.connect when assigning ML jobs #46025

droberts195 commented Aug 27, 2019

elasticmachine commented Aug 27, 2019

benwtrent commented Aug 27, 2019

droberts195 commented Aug 27, 2019

benwtrent commented Aug 27, 2019 •

edited

Loading

droberts195 commented Aug 28, 2019

hendrikmuhs commented Mar 25, 2020

jasontedor commented Mar 27, 2020

hendrikmuhs commented Mar 27, 2020

droberts195 commented Mar 27, 2020

jasontedor commented Mar 28, 2020

droberts195 commented Apr 1, 2020

[ML] Consider cluster.remote.connect when assigning ML jobs #46025

[ML] Consider cluster.remote.connect when assigning ML jobs #46025

Comments

droberts195 commented Aug 27, 2019

elasticmachine commented Aug 27, 2019

benwtrent commented Aug 27, 2019

droberts195 commented Aug 27, 2019

benwtrent commented Aug 27, 2019 • edited Loading

droberts195 commented Aug 28, 2019

hendrikmuhs commented Mar 25, 2020

jasontedor commented Mar 27, 2020

hendrikmuhs commented Mar 27, 2020

droberts195 commented Mar 27, 2020

jasontedor commented Mar 28, 2020

droberts195 commented Apr 1, 2020

benwtrent commented Aug 27, 2019 •

edited

Loading