"Rare Terms" aggregation #20586

polyfractal · 2016-09-20T16:23:45Z

I'd like to propose a rare_terms aggregation that collects the set of terms which have n or fewer occurrences. This is essentially the opposite of "top-n" queries.

The motivation is that today, the only way to accurately collect the "lowest-n" is to execute a terms aggregation with size: INT_MAX so that all terms are collected, then filter client-side or with bucket_selector. This is obviously expensive: it requires the memory overhead of all the terms, plus executing leaf aggregations on all terms despite only caring about the "lowest-n".

Sorting by count ascending on the terms agg without setting size INT_MAX is trappy and we'd love to remove it :)

Algorithm

The algorithm uses a map and a bloom filter. The map tracks the total set of rare terms. The bloom filter tracks the set of terms > max_doc_count.

Pseudocode:

Map<String, int> terms = new Map();
BloomFilter bloom = new BloomFilter();

function collect(term) {

  // Check to see if this term is in our bloom filter
  if (bloom.get(term) == false) {
    int value = map.get(term);
    if (value == null) {

      // We've never seen this term before, initialize it's counter to 1
      map.put(term, 1);
    } else {

      value += 1;

      // We've seen this term before, but less than the threshold
      // so just increment its counter
      if (value < max_doc_count) {
        map.put(term, value);
      } else {
        // Otherwise we've breached the threshold, remove from
        // the map and add to the bloom filter
        map.remove(term);
        bloom.add(term)
      }
    }
  } else {
    /* 
      There is no else clause here, because inclusion in the bloom filter
       means the item has been seen > max_doc_count times and so is no longer
       a "rare" term

       Note that due to the nature of bloom filters, there will be some false-positives,
       which translates into not adding an item to the "rare list" when it should
    */
  }

}

Note: this ignores all the technical bits about replaying cached docs, etc.

Some potential extensions:

If max_doc_count: 1, we don't need to store the count and could instead just use a Set to further reduce the size
We could potentially be clever and not use a full int for each counter and instead do some bit-packing (e.g. max_doc_count: 16 only needs 4 bits), but we wouldn't be able to use a nice, convenient map :)

Properties

Size is linear to the number of "rare" terms, plus some constant overhead.
- Antagonistic input (majority of terms are "rare") is still undeniably expensive, but still better than a terms aggregation
- Worst case where all terms are rare is basically equivalent to a terms agg with size: INT_MAX
- Best we can tell, in all cases it is no worse than a terms aggregation in space or time.
Must be executed as a deferred bucket agg (e.g. collect_mode: breadth_first)
- As a benefit, leaf aggregations are only executed on the actually rare terms
Resulting rare terms are guaranteed to satisfy max_doc_count (e.g. no false positives)
Some terms may be rare but not represented in the list (e.g. has false negatives)
max_doc_count threshold is configurable, although the tradeoff between this and the terms agg becomes less clear-cut as max_doc_count increases (e.g. max_doc_count: 1000 will likely include a large portion of a zipfian distribution, so a terms agg coming from the other direction may be better)

Potentially this is sufficient enough that #17614 can be unblocked, and we can make the terms agg less trappy while providing better accuracy / space / time on rare term aggregations.

I'm going to start poking at an implementation.

/cc @colings86 @jpountz

The text was updated successfully, but these errors were encountered:

evanvolgas · 2016-09-21T18:00:07Z

This is a really good idea

abeyad · 2016-09-21T20:54:39Z

@polyfractal really neat idea. I wonder if the bloom filter's accuracy will be highly sensitive to the data in question? (i.e. we may miss a lot of rare terms if the number of terms > max_doc_count is large enough that hash collisions in the bloom filter are pretty common). I'm not sure really, hence posing the question.

jpountz · 2016-09-21T23:13:34Z

I was surprised this feature was more used than I expected so +1 to explore how we can improve it.

BloomFilter bloom = new BloomFilter()

Related to @abeyad's note about collisions, the initial sizing of the bloom filter might be tricky. For string fields we could use the cardinality of the field in the whole index as a basis, but for numerics we do not have such information.

max_doc_count threshold is configurable, although the tradeoff between this and the terms agg becomes less clear-cut as max_doc_count increases (e.g. max_doc_count: 1000 will likely include a large portion of a zipfian distribution, so a terms agg coming from the other direction may be better

I think we'd need both a size parameter in addition to the max_doc_count option in order to keep the size of responses bounded.

One thing that I wonder when reading your proposal is that maybe we could fold the same idea into the existing terms aggregation. For instance we could let the terms aggregation count all terms as today, except that in the case that counts are in ascending order, the InternalTerms impl would also include a bloom filter that contains all terms but the shard_size ones with the lower counts (one important benefit here is that the bloom filter would be easy to size since we would know how many elements we need to put in it up-front). Then at reduce time we would also do like today but terms that occur in none of the bloom filters would be reported to have a doc count error of 0 while other terms would be reported to have an unbounded error. By using a count-min sketch rather than a bloom filter, we might even be able to better order the rare terms and to report a bounded error (at the expense of higher memory usage).

polyfractal · 2016-09-21T23:41:34Z

@abeyad Agreed that initial sizing could be tricky. Bloom filter are notorious for "saturating" if they are undersized, so that's a valid concern.

I was imagining that bloom sizing would be configurable (ala precision_threshold for cardinality), so the user could adjust to their liking. E.g. try several settings to see what made the most sense regarding their hardware / accuracy requirements.

If we could auto-size on string fields that'd be great. Could we do similar for numerics with field_stats style data (e.g. min/max)?

I think we'd need both a size parameter in addition to the max_doc_count option in order to keep the size of responses bounded.

My only concern here is that it re-introduces a semi-unbounded error. Since the response size is limited, we can only state that you've reached the limit for docs < max_doc_count... and it's unclear how many more may have made the list.

But that may be a perfectly reasonable trade-off, especially since the common case (max_doc_count: 1) hopefully won't hit the size limit.

Then at reduce time we would also do like today but terms that occur in none of the bloom filters would be reported to have a doc count error of 0 while other terms would be reported to have an unbounded error.

I like this, it would add a bit more predictability to how approximate the results of the terms agg are.

Re: count-min, I think that may be tricky since count-min approximates poorly for the "long-tail" of a distribution, which is where these frequencies would be useful. Conservative updates + count-min-mean may help overcome it, but I think they are still the most accurate for the top-n of a zipfian distribution?

jpountz · 2016-09-22T08:40:50Z

Could we do similar for numerics with field_stats style data (e.g. min/max)?

We can have the min/max values if the fields are indexed too. It might lead to significantly oversizing the bloom filter though since not all values in the range might be used. If we decide to make it a different aggregation, we could just decide that it only works on keyword fields (similarly to the fact that the stats agg only works on numeric fields).

Re: count-min, I think that may be tricky since count-min approximates poorly for the "long-tail" of a distribution, which is where these frequencies would be useful.

Agreed. The benefit I was seeing is that it would allow to return a bounded error, which might be more useful than saying "this might be completely wrong" even if the error is highly overestimated.

polyfractal · 2016-09-22T09:14:12Z

Ah I see. Makes sense on both points :)

polyfractal · 2016-12-16T00:04:19Z

Just a quick update: I knocked together a really awful proof of concept. And it seems to work! I'll start cleaning it up and making it fit for a PR as time allows :)

4islam · 2017-04-25T21:52:54Z

This feature is needed very much for my project as well, so +1 for that. When can we expect release?

lagrianitis · 2017-11-18T08:29:28Z

Great idea! I have quite a few case I could use it without external scripting. +1

colings86 · 2018-03-13T11:35:09Z

@elastic/es-search-aggs

Rudedog9d · 2018-09-05T20:44:21Z

Has there been any status updates on this? I see we had POC code in December, then nothing more!

polyfractal · 2018-09-05T20:52:56Z

@Rudedog9d Still no ETA, but I've been working on it lately when I can find time. Branch is here if you want to see recent work: https://github.com/polyfractal/elasticsearch/tree/rare_terms2

The overall feature is mostly done, but we're refactoring how some of the reduction is done. And then I need to write tests and documentation. So we're probably 70% done, give or take.

Rudedog9d · 2018-09-05T21:09:37Z

Thanks for the update! Assuming this did get done soonish, do you think this would be a feature added to elasticsearch 6, or rolled into a later release?

polyfractal · 2018-09-06T13:25:02Z

We don't generally make statements about specific release versions...since we're on the time-based release train, things just go in when they're done/ready.

I can say that the RareTerms agg doesn't use/need any technical features that are only in 7.0+. So if it gets finished during the 6.x timeline, there's no reason it won't be part of the 6.x series :)

polyfractal · 2019-07-01T15:01:10Z

For anyone subscribed to this issue, rare_terms just merged into 7.x via #35718. Which means it should be released as part of 7.3.0 🎉

Thanks everyone for your patience, this issue has been open a very long time. Definitely looking for feedback on the new agg, there are several knobs we can tweak and/or expose depending on how it works "in the wild".

clintongormley added >feature :Analytics/Aggregations Aggregations labels Sep 20, 2016

$@polyfractal$ polyfractal mentioned this issue Jun 1, 2018

Remove support for sorting terms aggregation by ascending count #17614

Closed

$@polyfractal$ polyfractal mentioned this issue Nov 19, 2018

Add RareTerms aggregation #35718

Merged

alexfrancoeur mentioned this issue Nov 28, 2018

Add support for RareTerms aggregation elastic/kibana#26340

Closed

$@polyfractal$ polyfractal closed this as completed in #35718 Jul 1, 2019

Mpdreamz mentioned this issue Aug 7, 2019

[meta] 7.3 Release elastic/elasticsearch-net#4001

Closed

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Rare Terms" aggregation #20586

"Rare Terms" aggregation #20586

polyfractal commented Sep 20, 2016

evanvolgas commented Sep 21, 2016

abeyad commented Sep 21, 2016

jpountz commented Sep 21, 2016 •

edited

Loading

polyfractal commented Sep 21, 2016

jpountz commented Sep 22, 2016

polyfractal commented Sep 22, 2016

polyfractal commented Dec 16, 2016

4islam commented Apr 25, 2017

lagrianitis commented Nov 18, 2017

colings86 commented Mar 13, 2018

Rudedog9d commented Sep 5, 2018

polyfractal commented Sep 5, 2018

Rudedog9d commented Sep 5, 2018

polyfractal commented Sep 6, 2018

polyfractal commented Jul 1, 2019

"Rare Terms" aggregation #20586

"Rare Terms" aggregation #20586

Comments

polyfractal commented Sep 20, 2016

Algorithm

Properties

evanvolgas commented Sep 21, 2016

abeyad commented Sep 21, 2016

jpountz commented Sep 21, 2016 • edited Loading

polyfractal commented Sep 21, 2016

jpountz commented Sep 22, 2016

polyfractal commented Sep 22, 2016

polyfractal commented Dec 16, 2016

4islam commented Apr 25, 2017

lagrianitis commented Nov 18, 2017

colings86 commented Mar 13, 2018

Rudedog9d commented Sep 5, 2018

polyfractal commented Sep 5, 2018

Rudedog9d commented Sep 5, 2018

polyfractal commented Sep 6, 2018

polyfractal commented Jul 1, 2019

jpountz commented Sep 21, 2016 •

edited

Loading