[ML] Reallocate model deployments on node shutdown events. #85310

davidkyle · 2022-03-24T09:44:51Z

When a cluster containing a trained model deployment is restarted and the Node Shutdown API is used to inform the cluster of a nodes pending removal there was a bug where the model deployment would get stuck in the starting state. The cause was that when the node returned it was still marked as shutdown so the allocation service would not deploy the model to that node. If the cluster contained multiple ML nodes the last node to be restarted would not have the model deployed, for single ML node clusters the deployment would not start.

The fix here triggers reallocation on Node Shutdown changes first then node change events if the Node Shutdown API has not been used.

Note: the workaround for the bug was to simply stop and restart the trained model deployment.

elasticmachine · 2022-03-24T09:44:54Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2022-03-24T09:45:15Z

Hi @davidkyle, I've created a changelog YAML for you.

droberts195

LGTM

droberts195 · 2022-03-24T11:17:59Z

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

+        boolean nodesShutdownChanged = event.changedCustomMetadataSet().contains(NodesShutdownMetadata.TYPE);
+        if (event.nodesChanged() || nodesShutdownChanged) {


Another condition that should be checked here is if persistent tasks change and a model is not allocated to every node, because if, say, a huge DFA job completes then that would free up space for the model to be allocated.

However, since we want to get the node shutdown change in in time for 8.1.2 I think that should be left to a followup.

I opened #85321 for this. As I said, please don't spend time adding it to this PR because we need this one in 8.1.2 as a matter of urgency whereas the persistent tasks case is a less likely scenario.

elasticsearchmachine · 2022-03-24T12:08:45Z

💔 Backport failed

Status	Branch	Result
❌	8.1	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 85310

…5310) Trigger reallocation on Node Shutdown changes first then node change events if the Node Shutdown API has not been used.

…85329) Trigger reallocation on Node Shutdown changes first then node change events if the Node Shutdown API has not been used.

Reallocate model deployments on node shutdown events.

22429c7

davidkyle added >bug :ml Machine learning auto-backport-and-merge cloud-deploy Publish cloud docker image for Cloud-First-Testing v8.2.0 v8.1.2 labels Mar 24, 2022

elasticmachine added the Team:ML Meta label for the ML team label Mar 24, 2022

Update docs/changelog/85310.yaml

9365032

droberts195 approved these changes Mar 24, 2022

View reviewed changes

droberts195 mentioned this pull request Mar 24, 2022

[ML] Recheck trained model allocations when persistent tasks complete #85321

Closed

davidkyle merged commit a236bae into elastic:master Mar 24, 2022

davidkyle deleted the node-shutdowns branch March 24, 2022 12:07

davidkyle mentioned this pull request Mar 24, 2022

[ML] Reallocate model deployments on node shutdown events. (#85310) #85329

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Reallocate model deployments on node shutdown events. #85310

[ML] Reallocate model deployments on node shutdown events. #85310

davidkyle commented Mar 24, 2022

elasticmachine commented Mar 24, 2022

elasticsearchmachine commented Mar 24, 2022

droberts195 left a comment

droberts195 Mar 24, 2022

droberts195 Mar 24, 2022

elasticsearchmachine commented Mar 24, 2022

		boolean nodesShutdownChanged = event.changedCustomMetadataSet().contains(NodesShutdownMetadata.TYPE);
		if (event.nodesChanged() \|\| nodesShutdownChanged) {

[ML] Reallocate model deployments on node shutdown events. #85310

[ML] Reallocate model deployments on node shutdown events. #85310

Conversation

davidkyle commented Mar 24, 2022

elasticmachine commented Mar 24, 2022

elasticsearchmachine commented Mar 24, 2022

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Mar 24, 2022

Choose a reason for hiding this comment

droberts195 Mar 24, 2022

Choose a reason for hiding this comment

elasticsearchmachine commented Mar 24, 2022

💔 Backport failed