[ML] clear job size estimate cache when feature is reset #74494

benwtrent · 2021-06-23T15:01:32Z

Since the feature reset API clears out the .ml-* indices, it follows that it also deletes the machine learning jobs.

But, since the regular path of calling the delete job API is not followed, jobs that no longer exist could still have memory estimates cached on the master node. These would never get cleared out until after a master node changed.

This commit causes feature reset to:

await for all refresh requests to finish (of which there should usually be NONE as all assignments have been cancelled)
clear out the cached hashmap of memory estimates sitting on the master node
Then once cleared, new refreshes are allowed again

elasticmachine · 2021-06-23T15:24:05Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2021-06-23T15:37:12Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

-                    // Call into the original listener to clean up the indices
-                    SystemIndexPlugin.super.cleanUpFeature(clusterService, client, unsetResetModeListener);
+                    // Call into the original listener to clean up the indices and then clear ml memory cache
+                    SystemIndexPlugin.super.cleanUpFeature(clusterService, client, cleanedUpIndicesListener);


Since the memory tracker cleanup waits for refreshes to finish, I would do this index cleanup after clearing the memory tracker. It should avoid logging of spurious errors from refreshes that fail because the indices they're accessing get deleted.

Or was there a good reason for clearing the memory tracker last?

@droberts195, I figured if the job's potentially still existed, it would be good to keep around their estimates. But, since all jobs should be closed by this point, clearing the tracker earlier is probably ok.

Yes, the jobs should be closed, and banned from reopening by the reset-in-progress cluster setting. I think flipping the order may avoid log spam.

droberts195

LGTM if you can get to the bottom of the test failure

benwtrent · 2021-06-23T20:49:48Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

+        // We don't need to check anything as there are no tasks
+        // This is a quick path to downscale.
+        // simply return `0` for scale down if delay is satisfied
+        if (anomalyDetectionTasks.isEmpty() && dataframeAnalyticsTasks.isEmpty()) {


@droberts195 the test failure was a VALID test failure :). The test was taking advantage of the fact that the ml tracker was "up-to-date" due to the previous test.

When running the test directly, this is fine as the master node boots immediately and thus the memory is fresh.

When running the test AFTER another test, the cluster stays up and the tracker is reset.

So, to improve this behavior, I added this clause. It is a no-brainer really. Autoscaling should not check anything if there are 0 ML tasks and should be unaffected by memory tracker staleness.

benwtrent · 2021-06-23T20:50:55Z

@droberts195 requesting re-review as the change to fix the test was non-trivial.

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

droberts195

LGTM if you could just fix the typo Dave K pointed out before merging

…scaling/MlAutoscalingDeciderService.java Co-authored-by: David Kyle <[email protected]>

benwtrent · 2021-06-24T11:40:50Z

@elasticmachine update branch

…re-reset

…4560) Since the feature reset API clears out the `.ml-*` indices, it follows that it also deletes the machine learning jobs. But, since the regular path of calling the delete job API is not followed, jobs that no longer exist could still have memory estimates cached on the master node. These would never get cleared out until after a master node changed. This commit causes feature reset to: - await for all refresh requests to finish (of which there should usually be NONE as all assignments have been cancelled) - clear out the cached hashmap of memory estimates sitting on the master node - Then once cleared, new refreshes are allowed again

[ML] clear job size estimate cache when feature is reset

97bfae9

benwtrent requested a review from droberts195 June 23, 2021 15:01

droberts195 added :ml Machine learning >bug v7.14.0 v8.0.0 labels Jun 23, 2021

elasticmachine added the Team:ML Meta label for the ML team label Jun 23, 2021

droberts195 reviewed Jun 23, 2021

View reviewed changes

addressing PR comments and upping logging on failed test

3061baa

benwtrent force-pushed the feature/ml-clear-model-size-cache-on-feature-reset branch from 31dcc63 to 3061baa Compare June 23, 2021 17:44

droberts195 approved these changes Jun 23, 2021

View reviewed changes

benwtrent added 2 commits June 23, 2021 14:45

---remove this commit---

0cfe27e

fixing test and autoscaling logic

b9f262f

benwtrent commented Jun 23, 2021

View reviewed changes

benwtrent requested a review from droberts195 June 23, 2021 20:50

davidkyle reviewed Jun 24, 2021

View reviewed changes

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java Outdated Show resolved Hide resolved

droberts195 approved these changes Jun 24, 2021

View reviewed changes

benwtrent and others added 2 commits June 24, 2021 07:19

Update x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/auto…

277e404

…scaling/MlAutoscalingDeciderService.java Co-authored-by: David Kyle <[email protected]>

Update MlAutoscalingDeciderService.java

ea208e6

Merge branch 'master' into feature/ml-clear-model-size-cache-on-featu…

50b4a11

…re-reset

benwtrent merged commit c37184c into elastic:master Jun 24, 2021

benwtrent deleted the feature/ml-clear-model-size-cache-on-feature-reset branch June 24, 2021 13:30

benwtrent mentioned this pull request Jun 24, 2021

[7.x] [ML] clear job size estimate cache when feature is reset (#74494) #74560

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] clear job size estimate cache when feature is reset #74494

[ML] clear job size estimate cache when feature is reset #74494

benwtrent commented Jun 23, 2021

elasticmachine commented Jun 23, 2021

droberts195 Jun 23, 2021

benwtrent Jun 23, 2021

droberts195 Jun 23, 2021

droberts195 left a comment

benwtrent Jun 23, 2021 •

edited

Loading

benwtrent commented Jun 23, 2021

droberts195 left a comment

benwtrent commented Jun 24, 2021

[ML] clear job size estimate cache when feature is reset #74494

[ML] clear job size estimate cache when feature is reset #74494

Conversation

benwtrent commented Jun 23, 2021

elasticmachine commented Jun 23, 2021

droberts195 Jun 23, 2021

Choose a reason for hiding this comment

benwtrent Jun 23, 2021

Choose a reason for hiding this comment

droberts195 Jun 23, 2021

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent Jun 23, 2021 • edited Loading

Choose a reason for hiding this comment

benwtrent commented Jun 23, 2021

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent commented Jun 24, 2021

benwtrent Jun 23, 2021 •

edited

Loading