Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] clear job size estimate cache when feature is reset #74494

Conversation

benwtrent
Copy link
Member

Since the feature reset API clears out the .ml-* indices, it follows that it also deletes the machine learning jobs.

But, since the regular path of calling the delete job API is not followed, jobs that no longer exist could still have memory estimates cached on the master node. These would never get cleared out until after a master node changed.

This commit causes feature reset to:

  • await for all refresh requests to finish (of which there should usually be NONE as all assignments have been cancelled)
  • clear out the cached hashmap of memory estimates sitting on the master node
  • Then once cleared, new refreshes are allowed again

@benwtrent benwtrent requested a review from droberts195 June 23, 2021 15:01
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jun 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

// Call into the original listener to clean up the indices
SystemIndexPlugin.super.cleanUpFeature(clusterService, client, unsetResetModeListener);
// Call into the original listener to clean up the indices and then clear ml memory cache
SystemIndexPlugin.super.cleanUpFeature(clusterService, client, cleanedUpIndicesListener);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the memory tracker cleanup waits for refreshes to finish, I would do this index cleanup after clearing the memory tracker. It should avoid logging of spurious errors from refreshes that fail because the indices they're accessing get deleted.

Or was there a good reason for clearing the memory tracker last?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195, I figured if the job's potentially still existed, it would be good to keep around their estimates. But, since all jobs should be closed by this point, clearing the tracker earlier is probably ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the jobs should be closed, and banned from reopening by the reset-in-progress cluster setting. I think flipping the order may avoid log spam.

@benwtrent benwtrent force-pushed the feature/ml-clear-model-size-cache-on-feature-reset branch from 31dcc63 to 3061baa Compare June 23, 2021 17:44
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if you can get to the bottom of the test failure

// We don't need to check anything as there are no tasks
// This is a quick path to downscale.
// simply return `0` for scale down if delay is satisfied
if (anomalyDetectionTasks.isEmpty() && dataframeAnalyticsTasks.isEmpty()) {
Copy link
Member Author

@benwtrent benwtrent Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 the test failure was a VALID test failure :). The test was taking advantage of the fact that the ml tracker was "up-to-date" due to the previous test.

When running the test directly, this is fine as the master node boots immediately and thus the memory is fresh.

When running the test AFTER another test, the cluster stays up and the tracker is reset.

So, to improve this behavior, I added this clause. It is a no-brainer really. Autoscaling should not check anything if there are 0 ML tasks and should be unaffected by memory tracker staleness.

@benwtrent benwtrent requested a review from droberts195 June 23, 2021 20:50
@benwtrent
Copy link
Member Author

@droberts195 requesting re-review as the change to fix the test was non-trivial.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if you could just fix the typo Dave K pointed out before merging

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent merged commit c37184c into elastic:master Jun 24, 2021
@benwtrent benwtrent deleted the feature/ml-clear-model-size-cache-on-feature-reset branch June 24, 2021 13:30
benwtrent added a commit that referenced this pull request Jun 24, 2021
…4560)

Since the feature reset API clears out the `.ml-*` indices, it follows that it also deletes the machine learning jobs. 

But, since the regular path of calling the delete job API is not followed, jobs that no longer exist could still have memory estimates cached on the master node. These would never get cleared out until after a master node changed. 

This commit causes feature reset to: 
 - await for all refresh requests to finish (of which there should usually be NONE as all assignments have been cancelled)
 - clear out the cached hashmap of memory estimates sitting on the master node
 - Then once cleared, new refreshes are allowed again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants