Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Rare Terms" aggregation #20586

Closed
polyfractal opened this issue Sep 20, 2016 · 15 comments · Fixed by #35718
Closed

"Rare Terms" aggregation #20586

polyfractal opened this issue Sep 20, 2016 · 15 comments · Fixed by #35718

Comments

@polyfractal
Copy link
Contributor

I'd like to propose a rare_terms aggregation that collects the set of terms which have n or fewer occurrences. This is essentially the opposite of "top-n" queries.

The motivation is that today, the only way to accurately collect the "lowest-n" is to execute a terms aggregation with size: INT_MAX so that all terms are collected, then filter client-side or with bucket_selector. This is obviously expensive: it requires the memory overhead of all the terms, plus executing leaf aggregations on all terms despite only caring about the "lowest-n".

Sorting by count ascending on the terms agg without setting size INT_MAX is trappy and we'd love to remove it :)

Algorithm

The algorithm uses a map and a bloom filter. The map tracks the total set of rare terms. The bloom filter tracks the set of terms > max_doc_count.

Pseudocode:

Map<String, int> terms = new Map();
BloomFilter bloom = new BloomFilter();

function collect(term) {

  // Check to see if this term is in our bloom filter
  if (bloom.get(term) == false) {
    int value = map.get(term);
    if (value == null) {

      // We've never seen this term before, initialize it's counter to 1
      map.put(term, 1);
    } else {

      value += 1;

      // We've seen this term before, but less than the threshold
      // so just increment its counter
      if (value < max_doc_count) {
        map.put(term, value);
      } else {
        // Otherwise we've breached the threshold, remove from
        // the map and add to the bloom filter
        map.remove(term);
        bloom.add(term)
      }
    }
  } else {
    /* 
      There is no else clause here, because inclusion in the bloom filter
       means the item has been seen > max_doc_count times and so is no longer
       a "rare" term

       Note that due to the nature of bloom filters, there will be some false-positives,
       which translates into not adding an item to the "rare list" when it should
    */
  }

}

Note: this ignores all the technical bits about replaying cached docs, etc.

Some potential extensions:

  • If max_doc_count: 1, we don't need to store the count and could instead just use a Set to further reduce the size
  • We could potentially be clever and not use a full int for each counter and instead do some bit-packing (e.g. max_doc_count: 16 only needs 4 bits), but we wouldn't be able to use a nice, convenient map :)

Properties

  • Size is linear to the number of "rare" terms, plus some constant overhead.
    • Antagonistic input (majority of terms are "rare") is still undeniably expensive, but still better than a terms aggregation
    • Worst case where all terms are rare is basically equivalent to a terms agg with size: INT_MAX
    • Best we can tell, in all cases it is no worse than a terms aggregation in space or time.
  • Must be executed as a deferred bucket agg (e.g. collect_mode: breadth_first)
    • As a benefit, leaf aggregations are only executed on the actually rare terms
  • Resulting rare terms are guaranteed to satisfy max_doc_count (e.g. no false positives)
  • Some terms may be rare but not represented in the list (e.g. has false negatives)
  • max_doc_count threshold is configurable, although the tradeoff between this and the terms agg becomes less clear-cut as max_doc_count increases (e.g. max_doc_count: 1000 will likely include a large portion of a zipfian distribution, so a terms agg coming from the other direction may be better)

Potentially this is sufficient enough that #17614 can be unblocked, and we can make the terms agg less trappy while providing better accuracy / space / time on rare term aggregations.

I'm going to start poking at an implementation.

/cc @colings86 @jpountz

@evanvolgas
Copy link

This is a really good idea

@abeyad
Copy link

abeyad commented Sep 21, 2016

@polyfractal really neat idea. I wonder if the bloom filter's accuracy will be highly sensitive to the data in question? (i.e. we may miss a lot of rare terms if the number of terms > max_doc_count is large enough that hash collisions in the bloom filter are pretty common). I'm not sure really, hence posing the question.

@jpountz
Copy link
Contributor

jpountz commented Sep 21, 2016

I was surprised this feature was more used than I expected so +1 to explore how we can improve it.

BloomFilter bloom = new BloomFilter()

Related to @abeyad's note about collisions, the initial sizing of the bloom filter might be tricky. For string fields we could use the cardinality of the field in the whole index as a basis, but for numerics we do not have such information.

max_doc_count threshold is configurable, although the tradeoff between this and the terms agg becomes less clear-cut as max_doc_count increases (e.g. max_doc_count: 1000 will likely include a large portion of a zipfian distribution, so a terms agg coming from the other direction may be better

I think we'd need both a size parameter in addition to the max_doc_count option in order to keep the size of responses bounded.

One thing that I wonder when reading your proposal is that maybe we could fold the same idea into the existing terms aggregation. For instance we could let the terms aggregation count all terms as today, except that in the case that counts are in ascending order, the InternalTerms impl would also include a bloom filter that contains all terms but the shard_size ones with the lower counts (one important benefit here is that the bloom filter would be easy to size since we would know how many elements we need to put in it up-front). Then at reduce time we would also do like today but terms that occur in none of the bloom filters would be reported to have a doc count error of 0 while other terms would be reported to have an unbounded error. By using a count-min sketch rather than a bloom filter, we might even be able to better order the rare terms and to report a bounded error (at the expense of higher memory usage).

@polyfractal
Copy link
Contributor Author

@abeyad Agreed that initial sizing could be tricky. Bloom filter are notorious for "saturating" if they are undersized, so that's a valid concern.

I was imagining that bloom sizing would be configurable (ala precision_threshold for cardinality), so the user could adjust to their liking. E.g. try several settings to see what made the most sense regarding their hardware / accuracy requirements.

If we could auto-size on string fields that'd be great. Could we do similar for numerics with field_stats style data (e.g. min/max)?

I think we'd need both a size parameter in addition to the max_doc_count option in order to keep the size of responses bounded.

My only concern here is that it re-introduces a semi-unbounded error. Since the response size is limited, we can only state that you've reached the limit for docs < max_doc_count... and it's unclear how many more may have made the list.

But that may be a perfectly reasonable trade-off, especially since the common case (max_doc_count: 1) hopefully won't hit the size limit.

Then at reduce time we would also do like today but terms that occur in none of the bloom filters would be reported to have a doc count error of 0 while other terms would be reported to have an unbounded error.

I like this, it would add a bit more predictability to how approximate the results of the terms agg are.

Re: count-min, I think that may be tricky since count-min approximates poorly for the "long-tail" of a distribution, which is where these frequencies would be useful. Conservative updates + count-min-mean may help overcome it, but I think they are still the most accurate for the top-n of a zipfian distribution?

@jpountz
Copy link
Contributor

jpountz commented Sep 22, 2016

Could we do similar for numerics with field_stats style data (e.g. min/max)?

We can have the min/max values if the fields are indexed too. It might lead to significantly oversizing the bloom filter though since not all values in the range might be used. If we decide to make it a different aggregation, we could just decide that it only works on keyword fields (similarly to the fact that the stats agg only works on numeric fields).

Re: count-min, I think that may be tricky since count-min approximates poorly for the "long-tail" of a distribution, which is where these frequencies would be useful.

Agreed. The benefit I was seeing is that it would allow to return a bounded error, which might be more useful than saying "this might be completely wrong" even if the error is highly overestimated.

@polyfractal
Copy link
Contributor Author

Ah I see. Makes sense on both points :)

@polyfractal
Copy link
Contributor Author

Just a quick update: I knocked together a really awful proof of concept. And it seems to work! I'll start cleaning it up and making it fit for a PR as time allows :)

@4islam
Copy link

4islam commented Apr 25, 2017

This feature is needed very much for my project as well, so +1 for that. When can we expect release?

@lagrianitis
Copy link

Great idea! I have quite a few case I could use it without external scripting. +1

@colings86
Copy link
Contributor

@elastic/es-search-aggs

@Rudedog9d
Copy link

Has there been any status updates on this? I see we had POC code in December, then nothing more!

@polyfractal
Copy link
Contributor Author

@Rudedog9d Still no ETA, but I've been working on it lately when I can find time. Branch is here if you want to see recent work: https://github.com/polyfractal/elasticsearch/tree/rare_terms2

The overall feature is mostly done, but we're refactoring how some of the reduction is done. And then I need to write tests and documentation. So we're probably 70% done, give or take.

@Rudedog9d
Copy link

Thanks for the update! Assuming this did get done soonish, do you think this would be a feature added to elasticsearch 6, or rolled into a later release?

@polyfractal
Copy link
Contributor Author

We don't generally make statements about specific release versions...since we're on the time-based release train, things just go in when they're done/ready.

I can say that the RareTerms agg doesn't use/need any technical features that are only in 7.0+. So if it gets finished during the 6.x timeline, there's no reason it won't be part of the 6.x series :)

@polyfractal
Copy link
Contributor Author

For anyone subscribed to this issue, rare_terms just merged into 7.x via #35718. Which means it should be released as part of 7.3.0 🎉

Thanks everyone for your patience, this issue has been open a very long time. Definitely looking for feedback on the new agg, there are several knobs we can tweak and/or expose depending on how it works "in the wild".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants