-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Consider cluster.remote.connect when assigning ML jobs #46025
Comments
Pinging @elastic/ml-core |
@droberts195 what do you think about having this check within the Datafeed node assignment? It seems to fit more naturally there and those errors still pop-up in the UI. |
The datafeed node assignment is currently "assign the datafeed to the same node as its associated job". (This was originally done to avoid two network trips for the extracted data - currently when the datafeed posts the extracted data to the job even though it's calling a separate endpoint that will just involve transferring data between threads in the same JVM.) If we changed the datafeed node assignment such that it could fail after assignment of the associated job succeeded then you could get a situation where the job is open and hogging resources while the datafeed failed to start. This is another indication that we probably should have made datafeeds and anomaly detection jobs a single entity in the first place. It would be very frustrating for a user if they had 2 ML nodes, one of which was permitted to do CCS and the other not, and we assigned a job whose datafeed required CCS to the node that was not allowed to do CCS and then errored when trying to assign the datafeed to that same node. So I think if we did the check at the point of assigning the datafeed then we would have to relax the requirement that datafeeds get assigned to the same nodes as their associated jobs. But this would probably open up whole classes of other problems. I'm currently working on the principle that we should treat datafeeds and their associated anomaly detection jobs as one thing and not worry about binding them ever more tightly together. Having said that, if there's a quick win in having datafeeds that use CCS check the |
The reason I bring this up is that there will be complications adding this check for the anomaly detection job assignment. During the node assignment phase all the checks need to be synchronous and rely on the current cluster state. This poses a problem because the datafeed task is not created yet (I think) and we would have to read the configuration via the index. Having a "fail fast" path would definitely be a win, and I can write that up fairly quickly. At least that way we don't get mysterious errors when the DF attempts to start. |
Yes, true, the separation between jobs and datafeeds makes this very hard. The pragmatic approach is that we document that if you want to use CCS in datafeeds then every ML node must be permitted to do CCS. Then our error path that leaves the job open with no running datafeed in the event of a user breaching this rule is not so bad. |
I think that this issue can be closed now? |
I do not think so, but @droberts195 should know better. The task assigner for ML still needs to query the newly introduced role and assign jobs/datafeeds according to it iff a remote connection is required. (FWIW: for transforms the issue is solved, we migrated to the new node role) |
The state we are currently in is that if you have a cluster where some ML nodes can do cross cluster search and some ML nodes cannot then a datafeed that requires cross cluster search will eventually fail with this error: Line 54 in 1fc0432
If we think that's a good enough solution to the problem then this issue can be closed. Transforms does better because if you have a cluster where some transform nodes can do cross cluster search and some transform nodes cannot then the node assignment will put the transforms that require cross cluster search on the transform nodes that can do it. We could make ML do better, but it's non-trivial due to the job/datafeed split as detailed in earlier comments. We would need to get the job's datafeed early in the job assignment process, check if it needed CCS, then pass an extra boolean to the job node selector to say whether CCS was required. Then the job node selector could take this requirement into account like the transform node assignment does. I've added the |
Thanks for clarifying @hendrikmuhs and @droberts195. |
We discussed this and decided that it's too much of a complication to try to support ML jobs that require cross cluster search when only a subset of ML nodes can handle them. It would make assignment much harder. If we claimed to support it then people could legitimately question why we assigned jobs that didn't require CCS to the subset of nodes that supported it and then later found ourselves unable to assign jobs requiring CCS. Therefore we will stick with the current approach of throwing an error stating that all ML nodes must allow CCS if any job uses CCS. |
Two different users have now run into a problem where ML datafeeds mysteriously fail when they are configured to use cross cluster search but they get assigned to nodes with
cluster.remote.connect: false
. This problem is very hard to debug.We should take into account the value of the
cluster.remote.connect
setting during node assignment. A job whose associated datafeed uses cross cluster search should never be assigned to an ML node that hascluster.remote.connect: false
. If this means no suitable ML nodes exist in the cluster then this will result in an error message being generated that explains the problem - it will be clear that either the ML nodes must have CCS enabled on them or else datafeeds cannot use CCS.The text was updated successfully, but these errors were encountered: