-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix writing of parquet bloom filters #23604
Conversation
@@ -309,7 +309,7 @@ private DataStreams getDataStreams() | |||
getDataStreamsCalled = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bloom filter writing logic is changed to always collect a bloom filter and
discard it at the end if dictonary is present
So dictionaries and bloom filters are exclusive? Is there a potential perf degradation?
Current logic was failing to convert dictionary pages into bloom filter
when the fallback from dictionary to plain encoding happened after first page.
Do we know the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic assumed fallback would only happen on the first page/there would only be one page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So dictionaries and bloom filters are exclusive?
Initial PR was a bit too restrictive, i've tweaked it a bit to discard bloom filter only if all pages are dictionary encoded, rather than just on presence of dictionary.
Is there a potential perf degradation?
We will pay extra cost of building bloom filter even if the column turns out to be fully dictionary encoded. This will be only on bloom filter configured columns though.
The problem with the fallback based approach was that it would be more complicated to make ValueWriter remember all previous dictionary encoded pages and decode them back to plain values for populating into bloom filter. It's simpler to just populate bloom filter separately and discard if it's redundant with dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Current logic was failing to convert dictionary pages into bloom filter when the fallback from dictionary to plain encoding happened after first page. Bloom filter writing logic is changed to always collect a bloom filter and discard it at the end if only dictionary encoded pages are present. This avoids the need to depend on values writer fallback mechanism.
a08dc6d
to
6c744ba
Compare
return ImmutableList.of(new BufferData( | ||
dataStreams.data(), | ||
dataStreams.dictionaryPageSize(), | ||
isOnlyDictionaryEncodingPages ? Optional.empty() : dataStreams.bloomFilter(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we can't have both always? Seems like we should be able to perform bloom filtering even in the presence of dictionaries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fully dictionary encoded columns, the reader is already able to perform row-group pruning based on dictionary entries. Having a bloom filter doesn't give us anything extra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fully dictionary encoded columns, the reader is already able to perform row-group pruning based on dictionary entries
I thought that bloom filters could be more efficient at filtering for tactical (e.g. single value search) queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's lookup in a set vs lookup in a bloom filter. The CPU difference will be barely noticeable compared to everything else that goes on in parquet reader. Reading bloom filter takes extra reads from file and potentially has false positives.
Description
Current logic was failing to convert dictionary pages into bloom filter
when the fallback from dictionary to plain encoding happened after first page.
Bloom filter writing logic is changed to always collect a bloom filter and
discard it at the end if only dictionary encoded pages are present. This avoids
the need to depend on values writer fallback mechanism.
Additional context and related issues
Fixes #22701
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: