-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement adaptive BloomFilter algorithm #251
Implement adaptive BloomFilter algorithm #251
Conversation
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
2343d4c
to
11be2f9
Compare
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
By default, Adaptive BloomFilter can handle 1024*1024 NDV at most?. If user already know the range of NDV? User should use ClassicBloomFilters, right? |
Yes. For the recommendation, I updated here: #251 (comment). Will redo the benchmark and post test result and conclusion once all PRs merged. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx!
Description
This pull request (PR) introduces an enhanced BloomFilter implementation that extends the capabilities of the previously added classic BloomFilter. This new implementation builds the BloomFilter adaptively, determining optimal parameters for construction without relying on prior knowledge of cardinality.
Documentation: user manual will be updated in the next PR of adding SQL support.
PR Planned
Detailed Design
New Classes
BloomFilterFactory
that creates or deserialize BloomFilter.AdaptiveBloomFilter
that builds BloomFilter adaptively.Adaptive Algorithm
The Adaptive BloomFilter algorithm dynamically adjusts to varying cardinalities by initially creating 10 BloomFilters, each with a doubled number of expected number of items (NDV, number of distinct values) as candidates. Upon inserting a unique element, the
cardinality
counter increments. Because the BloomFilter's put item result can determine if it's the first time the item has been seen, this is achieved using the put item result of the largest BloomFilter candidate, which is the most accurate.At last, the algorithm selects the candidate with an NDV just greater than the current cardinality as the best candidate. To reduce the overhead introduced by more candidates, candidates with NDV smaller than the best are ignored during the put items or merge operations. In the scenario that the cardinality exceeds the largest candidate's NDV, the algorithm designates the largest candidate as the best, even if its false positive rate (FPP) decreases due to overflow.
In summary, this adaptive approach ensures the use of a BloomFilter with the right size, even in the absence of prior knowledge of cardinality, ensuring optimal performance and accuracy in diverse scenarios.
Distributed BloomFilter Aggregation
For better understanding how this works in Spark, here is the basic workflow when we run query such as
SELECT input_file_name(), bloom_filter_agg(clientip) FROM http_logs GROUP BY input_file_name()
:BloomFilterAgg
createsBloomFilter
instance and put itemsBloomFilterAgg
serializes BloomFilter and merge those within same bucket together after shuffleBloomFilterAgg
serializes BloomFilter again as final output resultBenchmark Test #
Below are the initial benchmark test results. We will conduct a comprehensive test after all PRs have been finalized and merged into the
main
branch. As of now, the conclusions drawn from the initial results are as follows:With Prior Knowledge of Uniform Cardinality
Without Prior Knowledge or Large Variations in Cardinality:
but cardinality varies significantly across files
Issues Resolved
#206
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.