-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Rebalance model allocations when an ML job is stopped #88323
[ML] Rebalance model allocations when an ML job is stopped #88323
Conversation
Pinging @elastic/ml-core (Team:ML) |
When ML jobs get stopped (ie. anomaly detection, data frame analytics, etc.) it could be that memory has been freed up and that we can now assign allocations for a model deployment. This commit adds to the `TrainedModelAssignmentClusterService` so that when a cluster state update is observed we also look whether a persistent task associated to an ML job has been stopped. If so, we trigger rebalance.
8f43483
to
be50dd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
|
||
// If an ML persistent task with process stopped we should rebalance as we could have | ||
// available memory that we did not have before. | ||
Optional<String> reasonIfMlJobsStopped = detectReasonIfMlJobsStopped(event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if haveMlNodesChanged(..)
returns an Optional you have a fluent API
return detectReasonIfMlJobsStopped(event)
.or(() -> haveMlNodesChanged(event, newMetadata));
) | ||
.build(); | ||
|
||
// No metadata in the new state means no allocations, so no updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment in relation to the assertion below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those comments were a result of copy-pasta from another test. I've removed them as they do not make sense indeed.
@elasticmachine update branch |
When ML jobs get stopped (ie. anomaly detection, data frame analytics, etc.)
it could be that memory has been freed up and that we can now assign allocations
for a model deployment.
This commit adds to the
TrainedModelAssignmentClusterService
so that whena cluster state update is observed we also look whether a persistent task
associated to an ML job has been stopped. If so, we trigger rebalance.