Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Rebalance model allocations when an ML job is stopped #88323

Conversation

dimitris-athanasiou
Copy link
Contributor

When ML jobs get stopped (ie. anomaly detection, data frame analytics, etc.)
it could be that memory has been freed up and that we can now assign allocations
for a model deployment.

This commit adds to the TrainedModelAssignmentClusterService so that when
a cluster state update is observed we also look whether a persistent task
associated to an ML job has been stopped. If so, we trigger rebalance.

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 6, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

When ML jobs get stopped (ie. anomaly detection, data frame analytics, etc.)
it could be that memory has been freed up and that we can now assign allocations
for a model deployment.

This commit adds to the `TrainedModelAssignmentClusterService` so that when
a cluster state update is observed we also look whether a persistent task
associated to an ML job has been stopped. If so, we trigger rebalance.
@dimitris-athanasiou dimitris-athanasiou force-pushed the rebalance-model-assignments-when-ml-job-is-stopped branch from 8f43483 to be50dd3 Compare July 6, 2022 18:04
@benwtrent benwtrent self-requested a review July 6, 2022 20:11
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

// If an ML persistent task with process stopped we should rebalance as we could have
// available memory that we did not have before.
Optional<String> reasonIfMlJobsStopped = detectReasonIfMlJobsStopped(event);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if haveMlNodesChanged(..) returns an Optional you have a fluent API

return detectReasonIfMlJobsStopped(event)
           .or(() -> haveMlNodesChanged(event, newMetadata));

)
.build();

// No metadata in the new state means no allocations, so no updates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment in relation to the assertion below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those comments were a result of copy-pasta from another test. I've removed them as they do not make sense indeed.

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou dimitris-athanasiou merged commit 12730cf into elastic:master Jul 18, 2022
@dimitris-athanasiou dimitris-athanasiou deleted the rebalance-model-assignments-when-ml-job-is-stopped branch July 18, 2022 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >non-issue Team:ML Meta label for the ML team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants