[ML] Rebalance model allocations when an ML job is stopped #88323

dimitris-athanasiou · 2022-07-06T17:53:29Z

When ML jobs get stopped (ie. anomaly detection, data frame analytics, etc.)
it could be that memory has been freed up and that we can now assign allocations
for a model deployment.

This commit adds to the TrainedModelAssignmentClusterService so that when
a cluster state update is observed we also look whether a persistent task
associated to an ML job has been stopped. If so, we trigger rebalance.

elasticmachine · 2022-07-06T17:53:33Z

Pinging @elastic/ml-core (Team:ML)

When ML jobs get stopped (ie. anomaly detection, data frame analytics, etc.) it could be that memory has been freed up and that we can now assign allocations for a model deployment. This commit adds to the `TrainedModelAssignmentClusterService` so that when a cluster state update is observed we also look whether a persistent task associated to an ML job has been stopped. If so, we trigger rebalance.

davidkyle

LGTM

davidkyle · 2022-07-07T11:37:41Z

...va/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java

        }

+        // If an ML persistent task with process stopped we should rebalance as we could have
+        // available memory that we did not have before.
+        Optional<String> reasonIfMlJobsStopped = detectReasonIfMlJobsStopped(event);


if haveMlNodesChanged(..) returns an Optional you have a fluent API

return detectReasonIfMlJobsStopped(event) .or(() -> haveMlNodesChanged(event, newMetadata));

davidkyle · 2022-07-07T11:41:11Z

...g/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterServiceTests.java

+            )
+            .build();
+
+        // No metadata in the new state means no allocations, so no updates


I don't understand this comment in relation to the assertion below

Those comments were a result of copy-pasta from another test. I've removed them as they do not make sense indeed.

dimitris-athanasiou · 2022-07-18T09:34:35Z

@elasticmachine update branch

…-stopped

dimitris-athanasiou added >non-issue :ml Machine learning v8.4.0 labels Jul 6, 2022

elasticmachine added the Team:ML Meta label for the ML team label Jul 6, 2022

dimitris-athanasiou force-pushed the rebalance-model-assignments-when-ml-job-is-stopped branch from 8f43483 to be50dd3 Compare July 6, 2022 18:04

benwtrent self-requested a review July 6, 2022 20:11

davidkyle approved these changes Jul 7, 2022

View reviewed changes

benwtrent approved these changes Jul 7, 2022

View reviewed changes

Address review comments

a57878e

Merge branch 'master' into rebalance-model-assignments-when-ml-job-is…

1715a8d

…-stopped

dimitris-athanasiou merged commit 12730cf into elastic:master Jul 18, 2022

dimitris-athanasiou deleted the rebalance-model-assignments-when-ml-job-is-stopped branch July 18, 2022 10:39

dimitris-athanasiou mentioned this pull request Aug 4, 2022

[ML] Recheck trained model allocations when persistent tasks complete #85321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Rebalance model allocations when an ML job is stopped #88323

[ML] Rebalance model allocations when an ML job is stopped #88323

dimitris-athanasiou commented Jul 6, 2022

elasticmachine commented Jul 6, 2022

davidkyle left a comment

davidkyle Jul 7, 2022

davidkyle Jul 7, 2022

dimitris-athanasiou Jul 18, 2022

dimitris-athanasiou commented Jul 18, 2022

[ML] Rebalance model allocations when an ML job is stopped #88323

[ML] Rebalance model allocations when an ML job is stopped #88323

Conversation

dimitris-athanasiou commented Jul 6, 2022

elasticmachine commented Jul 6, 2022

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle Jul 7, 2022

Choose a reason for hiding this comment

davidkyle Jul 7, 2022

Choose a reason for hiding this comment

dimitris-athanasiou Jul 18, 2022

Choose a reason for hiding this comment

dimitris-athanasiou commented Jul 18, 2022