-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] clear job size estimate cache when feature is reset #74494
[ML] clear job size estimate cache when feature is reset #74494
Conversation
Pinging @elastic/ml-core (Team:ML) |
// Call into the original listener to clean up the indices | ||
SystemIndexPlugin.super.cleanUpFeature(clusterService, client, unsetResetModeListener); | ||
// Call into the original listener to clean up the indices and then clear ml memory cache | ||
SystemIndexPlugin.super.cleanUpFeature(clusterService, client, cleanedUpIndicesListener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the memory tracker cleanup waits for refreshes to finish, I would do this index cleanup after clearing the memory tracker. It should avoid logging of spurious errors from refreshes that fail because the indices they're accessing get deleted.
Or was there a good reason for clearing the memory tracker last?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195, I figured if the job's potentially still existed, it would be good to keep around their estimates. But, since all jobs should be closed by this point, clearing the tracker earlier is probably ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the jobs should be closed, and banned from reopening by the reset-in-progress cluster setting. I think flipping the order may avoid log spam.
31dcc63
to
3061baa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if you can get to the bottom of the test failure
// We don't need to check anything as there are no tasks | ||
// This is a quick path to downscale. | ||
// simply return `0` for scale down if delay is satisfied | ||
if (anomalyDetectionTasks.isEmpty() && dataframeAnalyticsTasks.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 the test failure was a VALID test failure :). The test was taking advantage of the fact that the ml tracker was "up-to-date" due to the previous test.
When running the test directly, this is fine as the master node boots immediately and thus the memory is fresh.
When running the test AFTER another test, the cluster stays up and the tracker is reset.
So, to improve this behavior, I added this clause. It is a no-brainer really. Autoscaling should not check anything if there are 0
ML tasks and should be unaffected by memory tracker staleness.
@droberts195 requesting re-review as the change to fix the test was non-trivial. |
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if you could just fix the typo Dave K pointed out before merging
…scaling/MlAutoscalingDeciderService.java Co-authored-by: David Kyle <[email protected]>
@elasticmachine update branch |
…4560) Since the feature reset API clears out the `.ml-*` indices, it follows that it also deletes the machine learning jobs. But, since the regular path of calling the delete job API is not followed, jobs that no longer exist could still have memory estimates cached on the master node. These would never get cleared out until after a master node changed. This commit causes feature reset to: - await for all refresh requests to finish (of which there should usually be NONE as all assignments have been cancelled) - clear out the cached hashmap of memory estimates sitting on the master node - Then once cleared, new refreshes are allowed again
Since the feature reset API clears out the
.ml-*
indices, it follows that it also deletes the machine learning jobs.But, since the regular path of calling the delete job API is not followed, jobs that no longer exist could still have memory estimates cached on the master node. These would never get cleared out until after a master node changed.
This commit causes feature reset to: