[C++][Compute] Introduce Bloom filters to hash join #30736

asfimport · 2022-01-03T20:37:35Z

Bloom filters are a common way to improve performance of hash joins where many rows on the probe side of the hash join do not have matches on the build side. Bloom filters are often able to reduce the cost of eliminating such rows early in the processing pipeline, since they are cheaper to probe than the hash join hash table, but they can return false positives for a reasonably small percentage of inputs.

This task is about introducing a data structure of register blocked Bloom filter implementation (a practical modification of Bloom filter concept that is specifically tuned for use in query processing related to hash joins and both more space efficient and less costly than using hash table for filtering). The data structure should provide functionality for parallel construction from a vector of exec batches accumulated in memory and vectorized lookup and filtering for a single exec batch. It should not have a limit on the size of the Bloom filter (the number of inserted hashes), which requires using 64-bit hashes for larger inputs. It should be verified that build and probe costs are reasonable low and false positives rate is at most few percent (which should be acceptable in use for query processing).

Reporter: Michal Nowakiewicz / @michalursa
Assignee: Michal Nowakiewicz / @michalursa

Related issues:

[C++][Compute] Implement Bloom filter pushdown between hash joins (blocks)
[C++] Query engine umbrella issue (is a child of)

PRs and other links:

GitHub Pull Request #12067

_{Note: This issue was originally created as ARROW-15239. Please see the migration documentation for further details.}

asfimport · 2022-01-14T12:20:24Z

Krisztian Szucs / @kszucs:
Since the PR is in draft I'm postponing it to 8.0

asfimport · 2022-03-24T08:19:58Z

Weston Pace / @westonpace:
Issue resolved by pull request 12067
#12067

asfimport closed this as completed Mar 24, 2022

This was referenced Jan 11, 2023

[C++] Query engine umbrella issue #28385

Open

[C++][Compute] Implement Bloom filter pushdown between hash joins #30973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Compute] Introduce Bloom filters to hash join #30736

[C++][Compute] Introduce Bloom filters to hash join #30736

asfimport commented Jan 3, 2022 •

edited

Loading

asfimport commented Jan 14, 2022

asfimport commented Mar 24, 2022

[C++][Compute] Introduce Bloom filters to hash join #30736

[C++][Compute] Introduce Bloom filters to hash join #30736

Comments

asfimport commented Jan 3, 2022 • edited Loading

Related issues:

PRs and other links:

asfimport commented Jan 14, 2022

asfimport commented Mar 24, 2022

asfimport commented Jan 3, 2022 •

edited

Loading