Fix writing of parquet bloom filters #23604

raunaqmorarka · 2024-09-30T07:50:28Z

Description

Current logic was failing to convert dictionary pages into bloom filter
when the fallback from dictionary to plain encoding happened after first page.
Bloom filter writing logic is changed to always collect a bloom filter and
discard it at the end if only dictionary encoded pages are present. This avoids
the need to depend on values writer fallback mechanism.

Additional context and related issues

Fixes #22701

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Delta, Iceberg
* Fix writing of bloom filter columns in parquet files. ({issue}`22701`)

sopel39 · 2024-09-30T10:20:58Z

lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java

@@ -309,7 +309,7 @@ private DataStreams getDataStreams()
        getDataStreamsCalled = true;


Bloom filter writing logic is changed to always collect a bloom filter and
discard it at the end if dictonary is present

So dictionaries and bloom filters are exclusive? Is there a potential perf degradation?

Current logic was failing to convert dictionary pages into bloom filter
when the fallback from dictionary to plain encoding happened after first page.

Do we know the reason.

The current logic assumed fallback would only happen on the first page/there would only be one page.

So dictionaries and bloom filters are exclusive?

Initial PR was a bit too restrictive, i've tweaked it a bit to discard bloom filter only if all pages are dictionary encoded, rather than just on presence of dictionary.

Is there a potential perf degradation?

We will pay extra cost of building bloom filter even if the column turns out to be fully dictionary encoded. This will be only on bloom filter configured columns though.

The problem with the fallback based approach was that it would be more complicated to make ValueWriter remember all previous dictionary encoded pages and decode them back to plain values for populating into bloom filter. It's simpler to just populate bloom filter separately and discard if it's redundant with dictionary.

jkylling

Thank you!

Current logic was failing to convert dictionary pages into bloom filter when the fallback from dictionary to plain encoding happened after first page. Bloom filter writing logic is changed to always collect a bloom filter and discard it at the end if only dictionary encoded pages are present. This avoids the need to depend on values writer fallback mechanism.

sopel39 · 2024-09-30T16:53:38Z

lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java

+        return ImmutableList.of(new BufferData(
+                dataStreams.data(),
+                dataStreams.dictionaryPageSize(),
+                isOnlyDictionaryEncodingPages ? Optional.empty() : dataStreams.bloomFilter(),


Why we can't have both always? Seems like we should be able to perform bloom filtering even in the presence of dictionaries?

For fully dictionary encoded columns, the reader is already able to perform row-group pruning based on dictionary entries. Having a bloom filter doesn't give us anything extra.

For fully dictionary encoded columns, the reader is already able to perform row-group pruning based on dictionary entries

I thought that bloom filters could be more efficient at filtering for tactical (e.g. single value search) queries.

It's lookup in a set vs lookup in a bloom filter. The CPU difference will be barely noticeable compared to everything else that goes on in parquet reader. Reading bloom filter takes extra reads from file and potentially has false positives.

cla-bot bot added the cla-signed label Sep 30, 2024

raunaqmorarka requested review from jkylling, sopel39 and findinpath September 30, 2024 07:50

sopel39 approved these changes Sep 30, 2024

View reviewed changes

raunaqmorarka added 2 commits September 30, 2024 15:57

Avoid unnecessary casts in BloomFilterValuesWriter

efb51f1

Fix accounting of buffered page size and memory with bloom filter

2e330a9

jkylling approved these changes Sep 30, 2024

View reviewed changes

raunaqmorarka force-pushed the bloom-fix branch from a08dc6d to 6c744ba Compare September 30, 2024 13:55

raunaqmorarka requested review from jkylling and sopel39 September 30, 2024 14:12

sopel39 reviewed Sep 30, 2024

View reviewed changes

raunaqmorarka merged commit 11dea27 into trinodb:master Sep 30, 2024
92 of 93 checks passed

raunaqmorarka deleted the bloom-fix branch September 30, 2024 18:30

github-actions bot added this to the 460 milestone Sep 30, 2024

mosabua mentioned this pull request Oct 1, 2024

Add Trino 460 release notes #23571

Merged

1 task

raunaqmorarka mentioned this pull request Oct 16, 2024

IllegalArgumentException: dictionaryPagesSize and bloomFilter cannot both be set - when createing table with bloom filter #23803

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix writing of parquet bloom filters #23604

Fix writing of parquet bloom filters #23604

raunaqmorarka commented Sep 30, 2024 •

edited

Loading

sopel39 Sep 30, 2024

jkylling Sep 30, 2024

raunaqmorarka Sep 30, 2024

jkylling left a comment

sopel39 Sep 30, 2024

raunaqmorarka Sep 30, 2024

sopel39 Sep 30, 2024

raunaqmorarka Sep 30, 2024

		@@ -309,7 +309,7 @@ private DataStreams getDataStreams()
		getDataStreamsCalled = true;

Fix writing of parquet bloom filters #23604

Fix writing of parquet bloom filters #23604

Conversation

raunaqmorarka commented Sep 30, 2024 • edited Loading

Description

Additional context and related issues

Release notes

sopel39 Sep 30, 2024

Choose a reason for hiding this comment

jkylling Sep 30, 2024

Choose a reason for hiding this comment

raunaqmorarka Sep 30, 2024

Choose a reason for hiding this comment

jkylling left a comment

Choose a reason for hiding this comment

sopel39 Sep 30, 2024

Choose a reason for hiding this comment

raunaqmorarka Sep 30, 2024

Choose a reason for hiding this comment

sopel39 Sep 30, 2024

Choose a reason for hiding this comment

raunaqmorarka Sep 30, 2024

Choose a reason for hiding this comment

raunaqmorarka commented Sep 30, 2024 •

edited

Loading