Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement adaptive BloomFilter algorithm #251

Merged

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Feb 9, 2024

Description

This pull request (PR) introduces an enhanced BloomFilter implementation that extends the capabilities of the previously added classic BloomFilter. This new implementation builds the BloomFilter adaptively, determining optimal parameters for construction without relying on prior knowledge of cardinality.

Documentation: user manual will be updated in the next PR of adding SQL support.

  • Adaptive BloomFilter parameters
    • num_candidates: by default adaptive algorithm uses 10 candidates
    • fpp: false positive probability (passed to underlying BloomFilter algorithm)
  • Classic BloomFilter parameters
    • num_items: expected maximum number of distinct items (NDV)
    • fpp: same as above

PR Planned

Detailed Design

New Classes

  1. Added new abstraction BloomFilterFactory that creates or deserialize BloomFilter.
  2. Added new AdaptiveBloomFilter that builds BloomFilter adaptively.

Screenshot 2024-03-11 at 9 36 06 AM

Adaptive Algorithm

The Adaptive BloomFilter algorithm dynamically adjusts to varying cardinalities by initially creating 10 BloomFilters, each with a doubled number of expected number of items (NDV, number of distinct values) as candidates. Upon inserting a unique element, the cardinality counter increments. Because the BloomFilter's put item result can determine if it's the first time the item has been seen, this is achieved using the put item result of the largest BloomFilter candidate, which is the most accurate.

At last, the algorithm selects the candidate with an NDV just greater than the current cardinality as the best candidate. To reduce the overhead introduced by more candidates, candidates with NDV smaller than the best are ignored during the put items or merge operations. In the scenario that the cardinality exceeds the largest candidate's NDV, the algorithm designates the largest candidate as the best, even if its false positive rate (FPP) decreases due to overflow.

In summary, this adaptive approach ensures the use of a BloomFilter with the right size, even in the absence of prior knowledge of cardinality, ensuring optimal performance and accuracy in diverse scenarios.

Screenshot 2024-03-11 at 12 59 17 PM

Distributed BloomFilter Aggregation

For better understanding how this works in Spark, here is the basic workflow when we run query such as SELECT input_file_name(), bloom_filter_agg(clientip) FROM http_logs GROUP BY input_file_name():

  1. BloomFilterAgg creates BloomFilter instance and put items
  2. BloomFilterAgg serializes BloomFilter and merge those within same bucket together after shuffle
  3. BloomFilterAgg serializes BloomFilter again as final output result

Screenshot 2024-03-11 at 10 46 50 AM

Benchmark Test #

Below are the initial benchmark test results. We will conduct a comprehensive test after all PRs have been finalized and merged into the main branch. As of now, the conclusions drawn from the initial results are as follows:

  1. With Prior Knowledge of Uniform Cardinality

    • When users has prior knowledge of the cardinality, and the cardinality of each file is similar, it is recommended to employ the non-adaptive BloomFilter algorithm.
  2. Without Prior Knowledge or Large Variations in Cardinality:

    • In scenarios where the cardinality is unknown or where cardinality exhibits large variations, it is advisable to utilize the default adaptive BloomFilter algorithm. [Test case T7]
    • Otherwise, the non-adaptive algorithm may still consume significant disk space. [Test case T5 and T6]
Test Case Indexing Latency (sec) Index Size File Scanned Query Latency (sec) Comment
T1: No Index 0 0 1045 641
T2: ValueSet 610 336k 1033 651 Only 12 files has lower cardinality than 100
T3: ValueSet(50k) 663 193m 535 455
T4: ValueSet(1m) 856 1g 1 13
T5: BloomFilter(200k) 740 543m 1 14 user has prior knowledge of maximum cardinality,
but cardinality varies significantly across files
T6: BloomFilter(1m) 784 1.7g 1 15
T7: BloomFilter(adaptive) 783 241m 10 21 9 false positives

Issues Resolved

#206

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen added enhancement New feature or request 0.2 labels Feb 9, 2024
@dai-chen dai-chen self-assigned this Feb 9, 2024
@dai-chen dai-chen changed the title Implement adaptive bloom filter algorithm Implement adaptive BloomFilter algorithm Feb 9, 2024
@dai-chen dai-chen added 0.3 and removed 0.2 labels Feb 28, 2024
@dai-chen dai-chen force-pushed the implement-adaptive-bloom-filter branch from 2343d4c to 11be2f9 Compare March 8, 2024 00:47
@dai-chen dai-chen marked this pull request as ready for review March 12, 2024 16:45
@penghuo
Copy link
Collaborator

penghuo commented Mar 13, 2024

The Adaptive BloomFilter algorithm dynamically adjusts to varying cardinalities by initially creating 10 BloomFilters, each with a doubled number of expected number of items (NDV, number of distinct values) as candidates.

By default, Adaptive BloomFilter can handle 1024*1024 NDV at most?. If user already know the range of NDV? User should use ClassicBloomFilters, right?

@dai-chen
Copy link
Collaborator Author

The Adaptive BloomFilter algorithm dynamically adjusts to varying cardinalities by initially creating 10 BloomFilters, each with a doubled number of expected number of items (NDV, number of distinct values) as candidates.

By default, Adaptive BloomFilter can handle 1024*1024 NDV at most?. If user already know the range of NDV? User should use ClassicBloomFilters, right?

Yes. For the recommendation, I updated here: #251 (comment). Will redo the benchmark and post test result and conclusion once all PRs merged. Thanks!

Copy link
Collaborator

@penghuo penghuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!

@dai-chen dai-chen merged commit 8cdc171 into opensearch-project:main Mar 13, 2024
4 checks passed
@dai-chen dai-chen deleted the implement-adaptive-bloom-filter branch March 13, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.3 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants