-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RareTerms aggregation #35718
Add RareTerms aggregation #35718
Conversation
This adds a `rare_terms` aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts. This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to size: LONG_MAX (or worse, ordering a terms agg by count ascending, which has unbounded error). This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a bloom filter. If a future term is found in the bloom filter we assume it was previously removed from the map and is "common". The map keys are the "rare" terms after collection is done.
/** | ||
* A bloom filter. Inspired by Guava bloom filter implementation though with some optimizations. | ||
*/ | ||
public class BloomFilter implements Writeable, Releasable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class was resurrected from the depths of git. This BloomFilter used to be used by ES elsewhere (doc IDs I think?), but I just realized none of the tests made it through the resurrection.
I'll start looking for those tests, or add my own.
This class had a lot of extra cruft that wasn't needed anymore (string configuration parsing, factories, multiple hashing versions, etc) so I tried to simplify it where possible.
/cc @tsg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@polyfractal I left some comments
WARNING: When aggregating on multiple indices the type of the aggregated field may not be the same in all indices. | ||
Some types are compatible with each other (`integer` and `long` or `float` and `double`) but when the types are a mix | ||
of decimal and non-decimal number the terms aggregation will promote the non-decimal numbers to decimal numbers. | ||
This can result in a loss of precision in the bucket values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should exclude float and double fields from this aggregation since the long-tail is likely to be far too long to practically use this aggregation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ Seems reasonable to me. Would cut down some of the complexity of the agg too, which is a nice perk :)
*/ | ||
public class BloomFilter implements Writeable, Releasable { | ||
|
||
// Some numbers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not really clear what these numbers are, could you add more explanation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz do you happen to know, or know who would? This class was taken from the old BloomFilter that I think was used for UUID lookups on segments.
These numbers used to correlate to the string that was passed in the config, and I think they are in the format
<expected insertions> = <false positive probability : <bloom size> , <num hashes>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know for sure but would assume the same format indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks. I'll reformat and tidy up the comment so it makes a bit more sense in the current code
private final Hashing hashing = Hashing.V1; | ||
|
||
/** | ||
* Creates a bloom filter based on the with the expected number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some words missing here: "Creates a bloom filter based on the ???? with the expected number"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Creates a bloom filter based on the with the expected number | |
* Creates a bloom filter based on the expected number |
Actually based ont he below constructor maybe there are some extra words?
/* | ||
* TODO(user): Put a warning in the javadoc about tiny fpp values, | ||
* since the resulting size is proportional to -log(p), but there is not | ||
* much of a point after all, e.g. optimalM(1000, 0.0000000000000001) = 76680 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* much of a point after all, e.g. optimalM(1000, 0.0000000000000001) = 76680 | |
* much of a point after all, e.g. optimalNumOfBits(1000, 0.0000000000000001) = 76680 |
data[i] = in.readLong(); | ||
} | ||
this.numHashFunctions = in.readVInt(); | ||
this.bits = new BitArray(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we swap this to the line above so everything reading and building the BitArray
is together?
newBucketOrd = newBucketOrds.add(oldKey); | ||
} else { | ||
// Make a note when one of the ords has been deleted | ||
hasDeletedEntry = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure this GC is working correctly I wonder if it's worth having a counter here and then checking the counter value is the same as the numDeleted that we expect at the end of this for loop? Another option would be to initialise the variable to numDeleted
and decrement it here ensuring it reaches 0.
...rc/main/java/org/elasticsearch/search/aggregations/bucket/terms/LongRareTermsAggregator.java
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/search/aggregations/bucket/terms/InternalMappedRareTerms.java
Show resolved
Hide resolved
ExecutionMode execution = ExecutionMode.MAP; //TODO global ords not implemented yet, only supports "map" | ||
|
||
DocValueFormat format = config.format(); | ||
if ((includeExclude != null) && (includeExclude.isRegexBased()) && format != DocValueFormat.RAW) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the DocValueFormat.RAW
check is being used to determine that the field used is a string field. But I see a few issues here (unless I'm misunderstanding what this is doing):
- Users can apply custom formats to non-string fields
- The
valuesSource
has already been checked above to be a ValuesSource.Bytes so this can only be a string field here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shamefully c/p this from the Terms agg factory :)
Lemme see if we can fix this in the Terms agg itself (in a separate PR) and then I'll pull the change forward into this one.
private long numDeleted = 0; | ||
|
||
@Override | ||
public void collect(int docId, long bucket) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments apply as from LongRareTermsAggregator
above. Also since this logical is almost the same in three places does it make sense to extract it to something common so we can fix it in one place and apply it to all implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried hard to refactor collect() and gcDeletedEntries() into one place... and it's just not possible. There are too many differences between longs and BytesRef. Map get/set, ordinals, hashing, doc values, etc are all different and there aren't any shared types that allow it to be resolve easily :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed through the patch. The general idea of how this works makes sense to me, here are some questions:
- Do we need a
shard_size
parameter? There could be millions of values that have a doc_count of 1 on each shard? And maybe asize
parameter as well in case hundreds of shards are queried? I usually don't like adding parameters but I'm afraid that this aggregation might be hard to use without those? - Maybe we could try to be smarter with the bloom filter and start with a set that contains hashes that we only upgrade to a lossy bloom filter when it starts using more memory than the equivalent bloom filter.
- We should somehow register memory that we allocate for the bloom filter and other data structures to the circuit breakers.
- Do we need to support sub aggregations? It adds quite some complexity. Also compared to terms aggs a lot of terms might be pruned on the coordinating node because they exist on other shards as well, which might require to increase the shard size which in-turn makes sub aggregations even heavier.
- I'm not convinced sharing the hierarchy with terms aggregations helps? It might even make it harder to do changes to the terms aggregation in the future?
} | ||
|
||
// Note: We use this instead of java.util.BitSet because we need access to the long[] data field | ||
static final class BitArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about Lucene's LongBitSet?
Thanks for the reviews @colings86 @jpountz. Will try to get to them this week.
I agree this is an issue... but doesn't adding Perhaps we make it an all-or-nothing agg, and spell out the ramifications in the docs clearly? E.g. track as we add to the map of potentially-rare terms, and if we ever breach the
++ Will look into these. I think they make sense, and if we don't mind a bit of extra c/p decoupling from the terms agg would simplify a few things elsewhere.
I'm not sure... I feel like users may want to run sub-aggs on their rare terms. But not positive. @clintongormley @tsg do you have any thoughts on this? |
@elastic/es-analytics-geo |
@colings86 Pushed some updates to the documentation and tidied up some tests/comments. I think the new algo changes are ok to review. 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a really quick pass but I need to more thoroughly go through the CuckooFilters again
docs/reference/aggregations/bucket/rare-terms-aggregation.asciidoc
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/util/CuckooFilter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/util/CuckooFilter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look into CuckooFilter and SetBackedScalingCukooFilter, very cool. I left some comments in the merging logic.
if (isSetMode && other.isSetMode) { | ||
// Both in sets, merge collections then see if we need to convert to cuckoo | ||
hashes.addAll(other.hashes); | ||
maybeConvert(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, if this filter is just under the threshold and the other one as well we will en up with hashes being almost twice over the threshold. Is that desired?
I wonder if we can compute the final size and decide if we want to convert already and then apply the values to the converted filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, we can go about twice over the threshold. Tricky to estimate if we should convert to a filter first though. If both sets are duplicates of each other, the total size might not change (or change much). But we won't know that until we've merged them together.
I think it won't matter too much if we go twice over the threshold, since the threshold is set very low relative to the size of the filters. E.g. the current (hard coded) threshold is 10,000 hashes. So 20k longs would be ~ 160kb, compared to the initial filter size of ~1.7mb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works for me
server/src/main/java/org/elasticsearch/common/util/SetBackedScalingCuckooFilter.java
Outdated
Show resolved
Hide resolved
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine update branch |
Holiday, deferred review to Ignacio :)
This adds a `rare_terms` aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts. This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to size: LONG_MAX (or worse, ordering a terms agg by count ascending, which has unbounded error). This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a cuckoo filter. If a future term is found in the cuckoo filter we assume it was previously removed from the map and is "common". The map keys are the "rare" terms after collection is done.
Docs for rare_terms were added in elastic#35718, but neglected to link it from the bucket index page
Docs for rare_terms were added in #35718, but neglected to link it from the bucket index page
This adds a
rare_terms
aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts.This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to
size: MAX_LONG
(or worse, ordering a terms agg by count ascending, which has unbounded error).This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a bloom filter. If a future term is found in the bloom filter we assume it was previously removed from the map and is "common".
The map keys are the "rare" terms after collection is done.
Outstanding issues
max_doc_count
that we allow? Currently set to 10 but I think that's probably too low. It's mainly another safety mechanism, the max buckets limit will still trigger too. It might not make sense to even have a max here, since it's pretty data-dependent.//TODO review
commentsCloses #20586 (finally!)
@andyb-elastic @not-napoleon tagged you both as reviewers in case your interested but no pressure if not, or too busy :)
Also /cc @clintongormley