-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Reallocate model deployments on node shutdown events. #85310
Conversation
Pinging @elastic/ml-core (Team:ML) |
Hi @davidkyle, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
boolean nodesShutdownChanged = event.changedCustomMetadataSet().contains(NodesShutdownMetadata.TYPE); | ||
if (event.nodesChanged() || nodesShutdownChanged) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another condition that should be checked here is if persistent tasks change and a model is not allocated to every node, because if, say, a huge DFA job completes then that would free up space for the model to be allocated.
However, since we want to get the node shutdown change in in time for 8.1.2 I think that should be left to a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #85321 for this. As I said, please don't spend time adding it to this PR because we need this one in 8.1.2 as a matter of urgency whereas the persistent tasks case is a less likely scenario.
💔 Backport failed
You can use sqren/backport to manually backport by running |
…5310) Trigger reallocation on Node Shutdown changes first then node change events if the Node Shutdown API has not been used.
When a cluster containing a trained model deployment is restarted and the Node Shutdown API is used to inform the cluster of a nodes pending removal there was a bug where the model deployment would get stuck in the
starting
state. The cause was that when the node returned it was still marked as shutdown so the allocation service would not deploy the model to that node. If the cluster contained multiple ML nodes the last node to be restarted would not have the model deployed, for single ML node clusters the deployment would not start.The fix here triggers reallocation on Node Shutdown changes first then node change events if the Node Shutdown API has not been used.
Note: the workaround for the bug was to simply stop and restart the trained model deployment.