[ML] Investigate alternative methods for sharing job memory usage information #34084

davidkyle · 2018-09-26T14:06:15Z

When there are multiple ml nodes in the cluster the job allocation decision is made based on the number of open jobs on each node and how much memory they use. Job memory usage is store in the job configuration and is updated periodically during the job's run when a model size stats doc is emitted by autodetect. This can lead to frequent job config updates (cluster state updates) particularly so for historical look-back jobs.

Consider moving the job's established memory usage from the config as it is a result of the job running not part of it's setup.
Consider alternative methods go gather the open job's memory usage and make that information trivially available to the code making the allocation decision.

This is pertinent to the job config migration project #32905 where the job's memory usage is not available in the cluster state during the allocation decision. A temporary work around was implemented in #33994 basing the decision on the job count rather than memory usage.

elasticmachine · 2018-09-26T14:06:19Z

Pinging @elastic/ml-core

droberts195 · 2018-09-27T09:34:22Z

I agree that since we can't access the job config document from within the allocation decision there's no longer any point storing established model memory in it, so it should be removed (tolerated by the job parser for BWC purposes until 8.0 though). I only put it in the job config in the first place because it needed to be available to the allocation decision and the job task status had a strict parser.

Core Elasticsearch has a similar problem for allocating shards to nodes. The master node needs an up-to-date view of how much disk space each data node currently has. This problem is solved by the InternalClusterInfoService class. I think we should add a similar class to the ML plugin, say MlClusterInfoService, keeping the name generic as one day we might want to collect something other than memory information. (We can't easily extend InternalClusterInfoService as it's not part of X-Pack.)

Once established model memory for each job is held in MlClusterInfoService we don't need to record it on any documents, as the definition is either analysis_limits.model_memory_limit or the model_size_bytes from the most recent model_size_stats document, and both these numbers can easily be obtained by anyone who wants to know them.

droberts195 · 2018-10-02T13:52:16Z

I'll have a go at implementing the idea from #34084 (comment)

droberts195 · 2018-10-26T14:50:24Z

I had a closer look at InternalClusterInfoService and we don't actually need our service to be as complex. InternalClusterInfoService has to make a request to every node periodically to get its latest disk space. But our ML task memory service doesn't actually need to communicate with every node. It can just periodically kick off async searches for the relevant model_size_stats documents from the master node.

The process for making sure the ML task memory service has a reasonable value for each active ML task can be:

Each job registers its memory requirement with the ML task memory service on opening. There is already code that gets the established model memory on opening a job created in 6.1 or earlier and sends the value found to an UpdateJobAction. This can be changed to always run on opening any job and instead send the value to a new master node action that updates the ML task memory service.
Periodically iterate all ML persistent tasks that require native memory and run the search that finds the latest memory requirement.

Then the TransportOpenJobAction.OpenJobPersistentTasksExecutor will have a reference to this new ML task memory service, so will be able to use the information in it to allocate the jobs.

droberts195 · 2018-12-19T10:10:59Z

Fixed by #36069

davidkyle added >enhancement :ml Machine learning team-discuss labels Sep 26, 2018

davidkyle mentioned this issue Sep 27, 2018

[ML] Metadata Migration Meta issue #32905

Closed

43 tasks

droberts195 self-assigned this Oct 2, 2018

droberts195 mentioned this issue Oct 2, 2018

[ML] Adjust finalize job action to work with documents #34226

Merged

davidkyle mentioned this issue Oct 25, 2018

[ML] Job in Index: Enable integ tests #34851

Merged

droberts195 removed the team-discuss label Oct 26, 2018

droberts195 closed this as completed Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Investigate alternative methods for sharing job memory usage information #34084

[ML] Investigate alternative methods for sharing job memory usage information #34084

davidkyle commented Sep 26, 2018

elasticmachine commented Sep 26, 2018

droberts195 commented Sep 27, 2018

droberts195 commented Oct 2, 2018

droberts195 commented Oct 26, 2018

droberts195 commented Dec 19, 2018

[ML] Investigate alternative methods for sharing job memory usage information #34084

[ML] Investigate alternative methods for sharing job memory usage information #34084

Comments

davidkyle commented Sep 26, 2018

elasticmachine commented Sep 26, 2018

droberts195 commented Sep 27, 2018

droberts195 commented Oct 2, 2018

droberts195 commented Oct 26, 2018

droberts195 commented Dec 19, 2018