-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Rare Terms" aggregation #20586
Comments
This is a really good idea |
@polyfractal really neat idea. I wonder if the bloom filter's accuracy will be highly sensitive to the data in question? (i.e. we may miss a lot of rare terms if the number of terms > max_doc_count is large enough that hash collisions in the bloom filter are pretty common). I'm not sure really, hence posing the question. |
I was surprised this feature was more used than I expected so +1 to explore how we can improve it.
Related to @abeyad's note about collisions, the initial sizing of the bloom filter might be tricky. For string fields we could use the cardinality of the field in the whole index as a basis, but for numerics we do not have such information.
I think we'd need both a size parameter in addition to the max_doc_count option in order to keep the size of responses bounded. One thing that I wonder when reading your proposal is that maybe we could fold the same idea into the existing |
@abeyad Agreed that initial sizing could be tricky. Bloom filter are notorious for "saturating" if they are undersized, so that's a valid concern. I was imagining that bloom sizing would be configurable (ala If we could auto-size on string fields that'd be great. Could we do similar for numerics with field_stats style data (e.g. min/max)?
My only concern here is that it re-introduces a semi-unbounded error. Since the response size is limited, we can only state that you've reached the limit for docs < max_doc_count... and it's unclear how many more may have made the list. But that may be a perfectly reasonable trade-off, especially since the common case (
I like this, it would add a bit more predictability to how approximate the results of the terms agg are. Re: count-min, I think that may be tricky since count-min approximates poorly for the "long-tail" of a distribution, which is where these frequencies would be useful. Conservative updates + count-min-mean may help overcome it, but I think they are still the most accurate for the top-n of a zipfian distribution? |
We can have the min/max values if the fields are indexed too. It might lead to significantly oversizing the bloom filter though since not all values in the range might be used. If we decide to make it a different aggregation, we could just decide that it only works on keyword fields (similarly to the fact that the stats agg only works on numeric fields).
Agreed. The benefit I was seeing is that it would allow to return a bounded error, which might be more useful than saying "this might be completely wrong" even if the error is highly overestimated. |
Ah I see. Makes sense on both points :) |
Just a quick update: I knocked together a really awful proof of concept. And it seems to work! I'll start cleaning it up and making it fit for a PR as time allows :) |
This feature is needed very much for my project as well, so +1 for that. When can we expect release? |
Great idea! I have quite a few case I could use it without external scripting. +1 |
@elastic/es-search-aggs |
Has there been any status updates on this? I see we had POC code in December, then nothing more! |
@Rudedog9d Still no ETA, but I've been working on it lately when I can find time. Branch is here if you want to see recent work: https://github.com/polyfractal/elasticsearch/tree/rare_terms2 The overall feature is mostly done, but we're refactoring how some of the reduction is done. And then I need to write tests and documentation. So we're probably 70% done, give or take. |
Thanks for the update! Assuming this did get done soonish, do you think this would be a feature added to elasticsearch 6, or rolled into a later release? |
We don't generally make statements about specific release versions...since we're on the time-based release train, things just go in when they're done/ready. I can say that the RareTerms agg doesn't use/need any technical features that are only in 7.0+. So if it gets finished during the 6.x timeline, there's no reason it won't be part of the 6.x series :) |
For anyone subscribed to this issue, Thanks everyone for your patience, this issue has been open a very long time. Definitely looking for feedback on the new agg, there are several knobs we can tweak and/or expose depending on how it works "in the wild". |
I'd like to propose a
rare_terms
aggregation that collects the set of terms which haven
or fewer occurrences. This is essentially the opposite of "top-n" queries.The motivation is that today, the only way to accurately collect the "lowest-n" is to execute a terms aggregation with
size: INT_MAX
so that all terms are collected, then filter client-side or withbucket_selector
. This is obviously expensive: it requires the memory overhead of all the terms, plus executing leaf aggregations on all terms despite only caring about the "lowest-n".Sorting by count ascending on the terms agg without setting size INT_MAX is trappy and we'd love to remove it :)
Algorithm
The algorithm uses a map and a bloom filter. The map tracks the total set of rare terms. The bloom filter tracks the set of terms > max_doc_count.
Pseudocode:
Note: this ignores all the technical bits about replaying cached docs, etc.
Some potential extensions:
max_doc_count: 1
, we don't need to store the count and could instead just use a Set to further reduce the sizemax_doc_count: 16
only needs 4 bits), but we wouldn't be able to use a nice, convenient map :)Properties
size: INT_MAX
collect_mode: breadth_first
)max_doc_count
(e.g. no false positives)max_doc_count
threshold is configurable, although the tradeoff between this and theterms
agg becomes less clear-cut asmax_doc_count
increases (e.g.max_doc_count: 1000
will likely include a large portion of a zipfian distribution, so aterms
agg coming from the other direction may be better)Potentially this is sufficient enough that #17614 can be unblocked, and we can make the terms agg less trappy while providing better accuracy / space / time on rare term aggregations.
I'm going to start poking at an implementation.
/cc @colings86 @jpountz
The text was updated successfully, but these errors were encountered: