Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Consider cluster.remote.connect when assigning ML jobs #46025

Closed
droberts195 opened this issue Aug 27, 2019 · 11 comments
Closed

[ML] Consider cluster.remote.connect when assigning ML jobs #46025

droberts195 opened this issue Aug 27, 2019 · 11 comments
Labels
>enhancement :ml Machine learning

Comments

@droberts195
Copy link
Contributor

Two different users have now run into a problem where ML datafeeds mysteriously fail when they are configured to use cross cluster search but they get assigned to nodes with cluster.remote.connect: false. This problem is very hard to debug.

We should take into account the value of the cluster.remote.connect setting during node assignment. A job whose associated datafeed uses cross cluster search should never be assigned to an ML node that has cluster.remote.connect: false. If this means no suitable ML nodes exist in the cluster then this will result in an error message being generated that explains the problem - it will be clear that either the ML nodes must have CCS enabled on them or else datafeeds cannot use CCS.

@droberts195 droberts195 added >enhancement :ml Machine learning labels Aug 27, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@benwtrent
Copy link
Member

@droberts195 what do you think about having this check within the Datafeed node assignment? It seems to fit more naturally there and those errors still pop-up in the UI.

@droberts195
Copy link
Contributor Author

what do you think about having this check within the Datafeed node assignment

The datafeed node assignment is currently "assign the datafeed to the same node as its associated job". (This was originally done to avoid two network trips for the extracted data - currently when the datafeed posts the extracted data to the job even though it's calling a separate endpoint that will just involve transferring data between threads in the same JVM.)

If we changed the datafeed node assignment such that it could fail after assignment of the associated job succeeded then you could get a situation where the job is open and hogging resources while the datafeed failed to start. This is another indication that we probably should have made datafeeds and anomaly detection jobs a single entity in the first place.

It would be very frustrating for a user if they had 2 ML nodes, one of which was permitted to do CCS and the other not, and we assigned a job whose datafeed required CCS to the node that was not allowed to do CCS and then errored when trying to assign the datafeed to that same node.

So I think if we did the check at the point of assigning the datafeed then we would have to relax the requirement that datafeeds get assigned to the same nodes as their associated jobs. But this would probably open up whole classes of other problems.

I'm currently working on the principle that we should treat datafeeds and their associated anomaly detection jobs as one thing and not worry about binding them ever more tightly together.

Having said that, if there's a quick win in having datafeeds that use CCS check the cluster.remote.connect setting as the first thing they do and fail fast if they find it's false on the node they've been allocated to then that's still an improvement over where we are today. It will be frustrating in the case where another ML node could have successfully run the datafeed, but not as frustrating as silently failing to achieve anything like it does today.

@benwtrent
Copy link
Member

benwtrent commented Aug 27, 2019

The reason I bring this up is that there will be complications adding this check for the anomaly detection job assignment. During the node assignment phase all the checks need to be synchronous and rely on the current cluster state. This poses a problem because the datafeed task is not created yet (I think) and we would have to read the configuration via the index.

Having a "fail fast" path would definitely be a win, and I can write that up fairly quickly. At least that way we don't get mysterious errors when the DF attempts to start.

@droberts195
Copy link
Contributor Author

Yes, true, the separation between jobs and datafeeds makes this very hard.

The pragmatic approach is that we document that if you want to use CCS in datafeeds then every ML node must be permitted to do CCS. Then our error path that leaves the job open with no running datafeed in the event of a user breaching this rule is not so bad.

@hendrikmuhs
Copy link

FWIW #53924 made this significantly easier to implement, e.g. similar implementation in transform: #54217

@jasontedor
Copy link
Member

I think that this issue can be closed now?

@hendrikmuhs
Copy link

I think that this issue can be closed now?

I do not think so, but @droberts195 should know better. The task assigner for ML still needs to query the newly introduced role and assign jobs/datafeeds according to it iff a remote connection is required.

(FWIW: for transforms the issue is solved, we migrated to the new node role)

@droberts195
Copy link
Contributor Author

The state we are currently in is that if you have a cluster where some ML nodes can do cross cluster search and some ML nodes cannot then a datafeed that requires cross cluster search will eventually fail with this error:

" Please enable node.remote_cluster_client for all machine learning nodes.";

If we think that's a good enough solution to the problem then this issue can be closed.

Transforms does better because if you have a cluster where some transform nodes can do cross cluster search and some transform nodes cannot then the node assignment will put the transforms that require cross cluster search on the transform nodes that can do it.

We could make ML do better, but it's non-trivial due to the job/datafeed split as detailed in earlier comments. We would need to get the job's datafeed early in the job assignment process, check if it needed CCS, then pass an extra boolean to the job node selector to say whether CCS was required. Then the job node selector could take this requirement into account like the transform node assignment does.

I've added the team-discuss label so we can discuss whether it's worth putting in this effort or whether the current error message is good enough.

@jasontedor
Copy link
Member

Thanks for clarifying @hendrikmuhs and @droberts195.

@droberts195
Copy link
Contributor Author

We discussed this and decided that it's too much of a complication to try to support ML jobs that require cross cluster search when only a subset of ML nodes can handle them. It would make assignment much harder. If we claimed to support it then people could legitimately question why we assigned jobs that didn't require CCS to the subset of nodes that supported it and then later found ourselves unable to assign jobs requiring CCS. Therefore we will stick with the current approach of throwing an error stating that all ML nodes must allow CCS if any job uses CCS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning
Projects
None yet
Development

No branches or pull requests

5 participants