feat: Runtime bloom filter in hash join operator #13147

mbutrovich · 2024-10-28T21:46:33Z

Which issue does this PR close?

Closes #.

Rationale for this change

Research literature describes how bloom filters can be a nice filter to probe before performing a (possibly) more expensive hash table lookup. This may become more important if we introduce spilling support for hash join, where a page will need to be fetched from disk to perform a hash table lookup. I see larger changes like SIP #13054, but this is a much more naïve idea.

What changes are included in this PR?

Use fastbloom in hash join executor to build a filter on the build side, then during probe check the bloom filter first. Bloom filter is not tuned for size yet (fixed to 8192 bytes, which may not be ideal) or number of hash functions, and my Rust is still pretty rudimentary.

Are these changes tested?

Existing hash join tests.

Edit: Running TPC-H.

Are there any user-facing changes?

No.

mbutrovich · 2024-10-28T23:18:07Z

JSON files got mixed up, performance is not thrilling yet. Closing for now.

Dandandan · 2024-10-29T01:39:02Z

@mbutrovich was already a bit suspicious...could you share the current performance?

For it to work great I think one probably needs to push the bloom filter down towards the other side as much as possible.

Dandandan · 2024-10-29T18:37:12Z

Maybe pushing it below RepartitionExec already would show some improvement, if the bloomfilter is fast enough 🤔

findepi · 2024-10-31T10:14:42Z

In Trino a similar feature is called Dynamic Filters, but it doesn't use Bloom filters for filtering.
AFAIK, Bloom filters were considered (that was the initial idea), but didn't end up being the chosen implementation strategy. The implementation tracks "ranges" of allowed values instead. I am not very familiar with the reasoning, but it wasn't incidental.
@mbutrovich would it be worthwhile to familiarize with Dynamic Filters in Trino and reasoning behind design decisions? Their slack is probably best place to ask.

Dandandan · 2024-10-31T10:51:35Z

Thanks @findepi

Another approach in a single node environment could be re-using the hashtable from the join as filter, this way no extra memory is created and no overhead for building the filter (in addition to being more accurate). @mbutrovich that might be something worth trying?

The main benefit from a bloom filter is saving memory - that might make sense in a distributed environment or when saving a filter to disk, but maybe not so much for pushing down a filter from hash join.

Dandandan · 2024-10-31T10:55:44Z

datafusion/physical-plan/src/joins/utils.rs

@@ -246,6 +268,11 @@ pub trait JoinHashMapType {
        let next_chain = self.get_list();
        for (row_idx, hash_value) in iter {
            // Get the hash and find it in the index
+            if let Some(bloom_filter) = self.get_bloom_filter() {
+                if !bloom_filter.contains(hash_value) {


So the main "problem" here I think is not pushing down the filter, bloom_filter.contains probably is about as expensive as hash_map.get, so only more overhead is created to create / probe the filter while having no benefit.

it could be adaptive. eg when the filter is observed not to filter out stuff, it could disable itself (for ever, or "for couple batches")

Yes it could be adaptive, however what I'm saying is that because it is directly used in hashjoin itself, there is no actual performance benefit.
It needs to be pushed down below repartition / aggregate / scan etc. to be of any benefit (#13054)

It needs to be pushed down below repartition / aggregate / scan etc. to be of any benefit

That's a good point.

if the dynamic filter was range-based and we could push it down to the file scan, it could allow file- and row-group-level pruning in Parquet.

mbutrovich and others added 6 commits October 15, 2024 15:54

Add bloom filter to hash join.

d4664d7

Add bloom filter to limit batch processing.

4916190

Prototype. Fails tests.

767933a

Passes tests.

0a61856

Merge branch 'apache:main' into hash_join_bloom

2e3ae6d

Swap out the hash function.

2c3e571

github-actions bot added the physical-expr Physical Expressions label Oct 28, 2024

mbutrovich changed the title ~~feat: runtime bloom filter in hash join operator~~ feat: Runtime bloom filter in hash join operator Oct 28, 2024

mbutrovich closed this Oct 28, 2024

Dandandan reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Runtime bloom filter in hash join operator #13147

feat: Runtime bloom filter in hash join operator #13147

mbutrovich commented Oct 28, 2024 •

edited

Loading

mbutrovich commented Oct 28, 2024

Dandandan commented Oct 29, 2024

Dandandan commented Oct 29, 2024

findepi commented Oct 31, 2024

Dandandan commented Oct 31, 2024 •

edited

Loading

Dandandan Oct 31, 2024 •

edited

Loading

findepi Oct 31, 2024

Dandandan Oct 31, 2024

findepi Nov 1, 2024

feat: Runtime bloom filter in hash join operator #13147

feat: Runtime bloom filter in hash join operator #13147

Conversation

mbutrovich commented Oct 28, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mbutrovich commented Oct 28, 2024

Dandandan commented Oct 29, 2024

Dandandan commented Oct 29, 2024

findepi commented Oct 31, 2024

Dandandan commented Oct 31, 2024 • edited Loading

Dandandan Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

findepi Oct 31, 2024

Choose a reason for hiding this comment

Dandandan Oct 31, 2024

Choose a reason for hiding this comment

findepi Nov 1, 2024

Choose a reason for hiding this comment

mbutrovich commented Oct 28, 2024 •

edited

Loading

Dandandan commented Oct 31, 2024 •

edited

Loading

Dandandan Oct 31, 2024 •

edited

Loading