[ML] Throttle the delete-by-query of expired results #47177

droberts195 · 2019-09-26T16:31:34Z

Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103

Due to elastic#47003 many clusters will have built up a large backlog of expired results. On upgrading to a version where that bug is fixed users could find that the first ML daily maintenance task deletes a very large amount of documents. This change introduces throttling to the delete-by-query that the ML daily maintenance uses to delete expired results: - Average 200 documents per second - Maximum of 10 million documents per day (There is no throttling for state/forecast documents as these are expected to be lower volume.) Relates elastic#47103

elasticmachine · 2019-09-26T16:31:37Z

Pinging @elastic/ml-core

dimitris-athanasiou

LGTM

dimitris-athanasiou · 2019-09-27T07:02:08Z

run elasticsearch-ci/default-distro

dimitris-athanasiou · 2019-09-27T07:02:22Z

run elasticsearch-ci/packaging-sample

hendrikmuhs · 2019-09-27T06:47:47Z

.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/retention/ExpiredResultsRemover.java

+        // Delete the documents gradually.
+        // With DEFAULT_SCROLL_SIZE = 1000 this implies we spread deletion of 1 million documents over 5000 seconds ~= 83 minutes.
+        // And we delete a maximum of 10000 batches per day (= 10 million documents per day if DEFAULT_SCROLL_SIZE = 1000).
+        // If more documents than this have expired then some will be deleted on a subsequent day.


this is executed per job, so the throttling applies only to 1 job. What if you have 1000's of jobs? Do we need more protection 1 level up (maybe based on time)?

another observation (not sure if it makes a difference): The query does not disable scoring, does it make sense to wrap it as constant_score? This might be more delete friendly.

We do not start the next job's deletion before the previous is complete.

We do not start the next job's deletion before the previous is complete

I think the requests per second limit is still valuable, but with a separate scroll per job a 10 million docs limit is pretty useless. I will instead impose a time limit, say 8 hours, at the outermost level of the expired data removal.

The query does not disable scoring, does it make sense to wrap it as constant_score?

I think sorting by _doc disables scoring. This is also the simplest order to retrieve the documents from Lucene. So I will add a sort by _doc to all our delete-by-query searches related to expired data and it should give both benefits.

Some more tests are still required, but these changes are a start

Due to #47003 many clusters will have built up a large backlog of expired results. On upgrading to a version where that bug is fixed users could find that the first ML daily maintenance task deletes a very large amount of documents. This change introduces throttling to the delete-by-query that the ML daily maintenance uses to delete expired results to limit it to deleting an average 200 documents per second. (There is no throttling for state/forecast documents as these are expected to be lower volume.) Additionally a rough time limit of 8 hours is applied to the whole delete expired data action. (This is only rough as it won't stop part way through a single operation - it only checks the timeout between operations.) Relates #47103

droberts195 added >enhancement :ml Machine learning v8.0.0 v7.5.0 v6.8.4 v7.4.1 labels Sep 26, 2019

dimitris-athanasiou approved these changes Sep 27, 2019

View reviewed changes

hendrikmuhs reviewed Sep 27, 2019

View reviewed changes

droberts195 added 3 commits September 27, 2019 18:45

Start addressing review comments

57c77ce

Some more tests are still required, but these changes are a start

Add a test for the iteration through removers with timeout checking

ea21431

Merge branch 'master' into throttle_results_deletion_dbq

6450ce9

droberts195 merged commit 865fe4f into elastic:master Oct 2, 2019

droberts195 deleted the throttle_results_deletion_dbq branch October 2, 2019 07:59

codebrain mentioned this pull request Oct 25, 2019

7.4.1 meta ticket elastic/elasticsearch-net#4174

Closed

39 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Throttle the delete-by-query of expired results #47177

[ML] Throttle the delete-by-query of expired results #47177

droberts195 commented Sep 26, 2019 •

edited

Loading

elasticmachine commented Sep 26, 2019

dimitris-athanasiou left a comment

dimitris-athanasiou commented Sep 27, 2019

dimitris-athanasiou commented Sep 27, 2019

hendrikmuhs Sep 27, 2019

hendrikmuhs Sep 27, 2019

dimitris-athanasiou Sep 27, 2019

droberts195 Sep 27, 2019

droberts195 Sep 27, 2019

[ML] Throttle the delete-by-query of expired results #47177

[ML] Throttle the delete-by-query of expired results #47177

Conversation

droberts195 commented Sep 26, 2019 • edited Loading

elasticmachine commented Sep 26, 2019

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

dimitris-athanasiou commented Sep 27, 2019

dimitris-athanasiou commented Sep 27, 2019

hendrikmuhs Sep 27, 2019

Choose a reason for hiding this comment

hendrikmuhs Sep 27, 2019

Choose a reason for hiding this comment

dimitris-athanasiou Sep 27, 2019

Choose a reason for hiding this comment

droberts195 Sep 27, 2019

Choose a reason for hiding this comment

droberts195 Sep 27, 2019

Choose a reason for hiding this comment

droberts195 commented Sep 26, 2019 •

edited

Loading