Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Reallocate model deployments on node shutdown events. #85310

Merged
merged 2 commits into from
Mar 24, 2022

Conversation

davidkyle
Copy link
Member

When a cluster containing a trained model deployment is restarted and the Node Shutdown API is used to inform the cluster of a nodes pending removal there was a bug where the model deployment would get stuck in the starting state. The cause was that when the node returned it was still marked as shutdown so the allocation service would not deploy the model to that node. If the cluster contained multiple ML nodes the last node to be restarted would not have the model deployed, for single ML node clusters the deployment would not start.

The fix here triggers reallocation on Node Shutdown changes first then node change events if the Node Shutdown API has not been used.

Note: the workaround for the bug was to simply stop and restart the trained model deployment.

@davidkyle davidkyle added >bug :ml Machine learning auto-backport-and-merge cloud-deploy Publish cloud docker image for Cloud-First-Testing v8.2.0 v8.1.2 labels Mar 24, 2022
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Mar 24, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +457 to +458
boolean nodesShutdownChanged = event.changedCustomMetadataSet().contains(NodesShutdownMetadata.TYPE);
if (event.nodesChanged() || nodesShutdownChanged) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another condition that should be checked here is if persistent tasks change and a model is not allocated to every node, because if, say, a huge DFA job completes then that would free up space for the model to be allocated.

However, since we want to get the node shutdown change in in time for 8.1.2 I think that should be left to a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #85321 for this. As I said, please don't spend time adding it to this PR because we need this one in 8.1.2 as a matter of urgency whereas the persistent tasks case is a less likely scenario.

@davidkyle davidkyle merged commit a236bae into elastic:master Mar 24, 2022
@davidkyle davidkyle deleted the node-shutdowns branch March 24, 2022 12:07
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.1 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 85310

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Mar 24, 2022
…5310)

Trigger reallocation on Node Shutdown changes first then node change events
if the Node Shutdown API has not been used.
elasticsearchmachine pushed a commit that referenced this pull request Mar 24, 2022
…85329)

Trigger reallocation on Node Shutdown changes first then node change events
if the Node Shutdown API has not been used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml Machine learning Team:ML Meta label for the ML team v8.1.2 v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants