-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Investigate alternative methods for sharing job memory usage information #34084
Comments
Pinging @elastic/ml-core |
I agree that since we can't access the job config document from within the allocation decision there's no longer any point storing established model memory in it, so it should be removed (tolerated by the job parser for BWC purposes until 8.0 though). I only put it in the job config in the first place because it needed to be available to the allocation decision and the job task status had a strict parser. Core Elasticsearch has a similar problem for allocating shards to nodes. The master node needs an up-to-date view of how much disk space each data node currently has. This problem is solved by the Once established model memory for each job is held in |
I'll have a go at implementing the idea from #34084 (comment) |
I had a closer look at The process for making sure the ML task memory service has a reasonable value for each active ML task can be:
Then the |
Fixed by #36069 |
When there are multiple ml nodes in the cluster the job allocation decision is made based on the number of open jobs on each node and how much memory they use. Job memory usage is store in the job configuration and is updated periodically during the job's run when a model size stats doc is emitted by autodetect. This can lead to frequent job config updates (cluster state updates) particularly so for historical look-back jobs.
This is pertinent to the job config migration project #32905 where the job's memory usage is not available in the cluster state during the allocation decision. A temporary work around was implemented in #33994 basing the decision on the job count rather than memory usage.
The text was updated successfully, but these errors were encountered: