-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to prune non-relevant top documents automatically #55603
Comments
Pinging @elastic/es-search (:Search/Search) |
Another idea is to have a dynamic |
Dropping terms is one way to improve precision but there are others too:
What I was considering when looking at ways to trim long tails of garbage was ways to tell when switching from a stricter search strategy to a weaker one breaks the meaning of the query. I think this may be detect-able if you are lucky enough to have well-categorised data (many ecommerce vendors spend a lot of time on this). There can be a step-change in the diversity of categories as you switch from a strong strategy to a weaker one - the count of categories can act as a measure of the number of different meanings a query clause has. Consider this analysis of high-scoring versus low-scoring results in an ecommerce query: If we can organise results into buckets based on the query clause strictness (I used large boosts to separate two clauses in the above example) then we can use a count of categories in each bucket as a measure of the focus in each clause. Poorly focused clauses might be ones with hundreds of categories and would be ones we might choose to drop. |
Hello, How can I really "cut off" the frequent term to be ignored from query (so I could have only documents containing at least "jump") without having to look for all frequent words and define them as stopwords ? |
I am removing the blocker label for now. We are still not decided if we should restore the functionality or provide a replacement. |
Pinging @elastic/es-search (Team:Search) |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
In 7.0 we've added an optimization that allows to run pure disjunction queries (OR) without visiting all matches of the most frequent terms. Prior to this version, users have to ensure that they remove the most frequent terms (stop words removal) or switch to the
common
terms query to get acceptable performance.We've decided to deprecate the
common
terms query for this reason. Users shouldn't rely on acutoff_frequency
in order to ensure fast disjunctions. The fact that thiscutoff_frequency
should change when documents are added/deleted but also that the frequency of the same term can be different even on replicas (since deleted docs are part of the count) makes it slightly dangerous to use. A small change in your index can make some queries much slower because an high-frequency terms don't reach the currentcutoff_frequency
anymore.However, the
common
terms query is also sometimes used to improve the precision of search results. For instance the querythe OR beatles
would return top documents containing onlythe
if there are no document containing the termbeatles
. Using thecommon
terms query can ensure (assuming that the cuttof_frequency considersthe
as a frequent term) that no results are returned in this case. This looks like a valid use case for this query so we're wondering if should un-deprecate it since we don't have a direct replacement for this feature.One thing that was raised during the initial discussion is that we should look at improving the detection of high frequent terms without the need for users to provide a precise
cuttof_frequency
. We also think that it's worth discussing all options which is why I am opening this issue and marking it as a blocker for 8.0.I am curious to hear thoughts from users of the
common
terms query and particularly how do you deal with changing indices to update thecutoff_frequency
?The text was updated successfully, but these errors were encountered: