Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Recheck trained model allocations when persistent tasks complete #85321

Closed
droberts195 opened this issue Mar 24, 2022 · 2 comments
Closed
Labels
:ml Machine learning Team:ML Meta label for the ML team

Comments

@droberts195
Copy link
Contributor

#85310 has made an improvement to the conditions for allocating trained models to nodes, but there is still a situation that is not covered.

It's possible that there could be a trained model that cannot be allocated because all the available ML native memory is being used for jobs. When these jobs are stopped memory will be freed up that might allow the trained model to be allocated, so we should recheck allocation.

Therefore, there should be an extra check in the trained model allocation cluster state listener for persistent tasks being changed, and, in particular, ML persistent tasks completing.

@droberts195 droberts195 added the :ml Machine learning label Mar 24, 2022
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Mar 24, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@dimitris-athanasiou
Copy link
Contributor

This has been addressed by #88323

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

3 participants