Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix writing of parquet bloom filters #23604

Merged
merged 3 commits into from
Sep 30, 2024

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Sep 30, 2024

Description

Current logic was failing to convert dictionary pages into bloom filter
when the fallback from dictionary to plain encoding happened after first page.
Bloom filter writing logic is changed to always collect a bloom filter and
discard it at the end if only dictionary encoded pages are present. This avoids
the need to depend on values writer fallback mechanism.

Additional context and related issues

Fixes #22701

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Delta, Iceberg
* Fix writing of bloom filter columns in parquet files. ({issue}`22701`)

@@ -309,7 +309,7 @@ private DataStreams getDataStreams()
getDataStreamsCalled = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bloom filter writing logic is changed to always collect a bloom filter and
discard it at the end if dictonary is present

So dictionaries and bloom filters are exclusive? Is there a potential perf degradation?

Current logic was failing to convert dictionary pages into bloom filter
when the fallback from dictionary to plain encoding happened after first page.

Do we know the reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic assumed fallback would only happen on the first page/there would only be one page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So dictionaries and bloom filters are exclusive?

Initial PR was a bit too restrictive, i've tweaked it a bit to discard bloom filter only if all pages are dictionary encoded, rather than just on presence of dictionary.

Is there a potential perf degradation?

We will pay extra cost of building bloom filter even if the column turns out to be fully dictionary encoded. This will be only on bloom filter configured columns though.

The problem with the fallback based approach was that it would be more complicated to make ValueWriter remember all previous dictionary encoded pages and decode them back to plain values for populating into bloom filter. It's simpler to just populate bloom filter separately and discard if it's redundant with dictionary.

Copy link
Contributor

@jkylling jkylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Current logic was failing to convert dictionary pages into bloom filter
when the fallback from dictionary to plain encoding happened after first page.
Bloom filter writing logic is changed to always collect a bloom filter and
discard it at the end if only dictionary encoded pages are present. This avoids
the need to depend on values writer fallback mechanism.
return ImmutableList.of(new BufferData(
dataStreams.data(),
dataStreams.dictionaryPageSize(),
isOnlyDictionaryEncodingPages ? Optional.empty() : dataStreams.bloomFilter(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we can't have both always? Seems like we should be able to perform bloom filtering even in the presence of dictionaries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fully dictionary encoded columns, the reader is already able to perform row-group pruning based on dictionary entries. Having a bloom filter doesn't give us anything extra.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fully dictionary encoded columns, the reader is already able to perform row-group pruning based on dictionary entries

I thought that bloom filters could be more efficient at filtering for tactical (e.g. single value search) queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's lookup in a set vs lookup in a bloom filter. The CPU difference will be barely noticeable compared to everything else that goes on in parquet reader. Reading bloom filter takes extra reads from file and potentially has false positives.

@raunaqmorarka raunaqmorarka merged commit 11dea27 into trinodb:master Sep 30, 2024
92 of 93 checks passed
@raunaqmorarka raunaqmorarka deleted the bloom-fix branch September 30, 2024 18:30
@github-actions github-actions bot added this to the 460 milestone Sep 30, 2024
@mosabua mosabua mentioned this pull request Oct 1, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

iceberg - dictionaryPagesSize and bloomFilter cannot both be set
3 participants