Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Throttle the delete-by-query of expired results #47177

Merged

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Sep 26, 2019

Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103

Due to elastic#47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results:

- Average 200 documents per second
- Maximum of 10 million documents per day

(There is no throttling for state/forecast documents
as these are expected to be lower volume.)

Relates elastic#47103
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dimitris-athanasiou
Copy link
Contributor

run elasticsearch-ci/default-distro

@dimitris-athanasiou
Copy link
Contributor

run elasticsearch-ci/packaging-sample

// Delete the documents gradually.
// With DEFAULT_SCROLL_SIZE = 1000 this implies we spread deletion of 1 million documents over 5000 seconds ~= 83 minutes.
// And we delete a maximum of 10000 batches per day (= 10 million documents per day if DEFAULT_SCROLL_SIZE = 1000).
// If more documents than this have expired then some will be deleted on a subsequent day.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is executed per job, so the throttling applies only to 1 job. What if you have 1000's of jobs? Do we need more protection 1 level up (maybe based on time)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another observation (not sure if it makes a difference): The query does not disable scoring, does it make sense to wrap it as constant_score? This might be more delete friendly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not start the next job's deletion before the previous is complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not start the next job's deletion before the previous is complete

I think the requests per second limit is still valuable, but with a separate scroll per job a 10 million docs limit is pretty useless. I will instead impose a time limit, say 8 hours, at the outermost level of the expired data removal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query does not disable scoring, does it make sense to wrap it as constant_score?

I think sorting by _doc disables scoring. This is also the simplest order to retrieve the documents from Lucene. So I will add a sort by _doc to all our delete-by-query searches related to expired data and it should give both benefits.

@droberts195 droberts195 merged commit 865fe4f into elastic:master Oct 2, 2019
@droberts195 droberts195 deleted the throttle_results_deletion_dbq branch October 2, 2019 07:59
droberts195 added a commit that referenced this pull request Oct 2, 2019
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103
droberts195 added a commit that referenced this pull request Oct 2, 2019
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103
droberts195 added a commit that referenced this pull request Oct 2, 2019
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants