Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Runtime bloom filter in hash join operator #13147

Closed
wants to merge 6 commits into from

Conversation

mbutrovich
Copy link

@mbutrovich mbutrovich commented Oct 28, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

Research literature describes how bloom filters can be a nice filter to probe before performing a (possibly) more expensive hash table lookup. This may become more important if we introduce spilling support for hash join, where a page will need to be fetched from disk to perform a hash table lookup. I see larger changes like SIP #13054, but this is a much more naïve idea.

What changes are included in this PR?

Use fastbloom in hash join executor to build a filter on the build side, then during probe check the bloom filter first. Bloom filter is not tuned for size yet (fixed to 8192 bytes, which may not be ideal) or number of hash functions, and my Rust is still pretty rudimentary.

Are these changes tested?

Existing hash join tests.

Edit: Running TPC-H.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-expr Physical Expressions label Oct 28, 2024
@mbutrovich mbutrovich changed the title feat: runtime bloom filter in hash join operator feat: Runtime bloom filter in hash join operator Oct 28, 2024
@mbutrovich
Copy link
Author

JSON files got mixed up, performance is not thrilling yet. Closing for now.

@mbutrovich mbutrovich closed this Oct 28, 2024
@Dandandan
Copy link
Contributor

@mbutrovich was already a bit suspicious...could you share the current performance?

For it to work great I think one probably needs to push the bloom filter down towards the other side as much as possible.

@Dandandan
Copy link
Contributor

Maybe pushing it below RepartitionExec already would show some improvement, if the bloomfilter is fast enough 🤔

@findepi
Copy link
Member

findepi commented Oct 31, 2024

In Trino a similar feature is called Dynamic Filters, but it doesn't use Bloom filters for filtering.
AFAIK, Bloom filters were considered (that was the initial idea), but didn't end up being the chosen implementation strategy. The implementation tracks "ranges" of allowed values instead. I am not very familiar with the reasoning, but it wasn't incidental.
@mbutrovich would it be worthwhile to familiarize with Dynamic Filters in Trino and reasoning behind design decisions? Their slack is probably best place to ask.

@Dandandan
Copy link
Contributor

Dandandan commented Oct 31, 2024

Thanks @findepi

Another approach in a single node environment could be re-using the hashtable from the join as filter, this way no extra memory is created and no overhead for building the filter (in addition to being more accurate). @mbutrovich that might be something worth trying?

The main benefit from a bloom filter is saving memory - that might make sense in a distributed environment or when saving a filter to disk, but maybe not so much for pushing down a filter from hash join.

@@ -246,6 +268,11 @@ pub trait JoinHashMapType {
let next_chain = self.get_list();
for (row_idx, hash_value) in iter {
// Get the hash and find it in the index
if let Some(bloom_filter) = self.get_bloom_filter() {
if !bloom_filter.contains(hash_value) {
Copy link
Contributor

@Dandandan Dandandan Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the main "problem" here I think is not pushing down the filter, bloom_filter.contains probably is about as expensive as hash_map.get, so only more overhead is created to create / probe the filter while having no benefit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be adaptive. eg when the filter is observed not to filter out stuff, it could disable itself (for ever, or "for couple batches")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it could be adaptive, however what I'm saying is that because it is directly used in hashjoin itself, there is no actual performance benefit.
It needs to be pushed down below repartition / aggregate / scan etc. to be of any benefit (#13054)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be pushed down below repartition / aggregate / scan etc. to be of any benefit

That's a good point.

if the dynamic filter was range-based and we could push it down to the file scan, it could allow file- and row-group-level pruning in Parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants