-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Runtime bloom filter in hash join operator #13147
Conversation
JSON files got mixed up, performance is not thrilling yet. Closing for now. |
@mbutrovich was already a bit suspicious...could you share the current performance? For it to work great I think one probably needs to push the bloom filter down towards the other side as much as possible. |
Maybe pushing it below |
In Trino a similar feature is called Dynamic Filters, but it doesn't use Bloom filters for filtering. |
Thanks @findepi Another approach in a single node environment could be re-using the hashtable from the join as filter, this way no extra memory is created and no overhead for building the filter (in addition to being more accurate). @mbutrovich that might be something worth trying? The main benefit from a bloom filter is saving memory - that might make sense in a distributed environment or when saving a filter to disk, but maybe not so much for pushing down a filter from hash join. |
@@ -246,6 +268,11 @@ pub trait JoinHashMapType { | |||
let next_chain = self.get_list(); | |||
for (row_idx, hash_value) in iter { | |||
// Get the hash and find it in the index | |||
if let Some(bloom_filter) = self.get_bloom_filter() { | |||
if !bloom_filter.contains(hash_value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the main "problem" here I think is not pushing down the filter, bloom_filter.contains
probably is about as expensive as hash_map.get
, so only more overhead is created to create / probe the filter while having no benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be adaptive. eg when the filter is observed not to filter out stuff, it could disable itself (for ever, or "for couple batches")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it could be adaptive, however what I'm saying is that because it is directly used in hashjoin itself, there is no actual performance benefit.
It needs to be pushed down below repartition / aggregate / scan etc. to be of any benefit (#13054)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be pushed down below repartition / aggregate / scan etc. to be of any benefit
That's a good point.
if the dynamic filter was range-based and we could push it down to the file scan, it could allow file- and row-group-level pruning in Parquet.
Which issue does this PR close?
Closes #.
Rationale for this change
Research literature describes how bloom filters can be a nice filter to probe before performing a (possibly) more expensive hash table lookup. This may become more important if we introduce spilling support for hash join, where a page will need to be fetched from disk to perform a hash table lookup. I see larger changes like SIP #13054, but this is a much more naïve idea.
What changes are included in this PR?
Use fastbloom in hash join executor to build a filter on the build side, then during probe check the bloom filter first. Bloom filter is not tuned for size yet (fixed to 8192 bytes, which may not be ideal) or number of hash functions, and my Rust is still pretty rudimentary.
Are these changes tested?
Existing hash join tests.
Edit: Running TPC-H.
Are there any user-facing changes?
No.