PARQUET-2366: Optimize random seek during rewriting #1174

ConeyLiu · 2023-10-17T11:40:16Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

The ColunIndex, OffsetIndex, and BloomFilter are stored at the end of the file. We need to randomly seek 4 times when rewriting a column chunk. We found this could impact the rewrite performance heavily for files with a number of columns(~1000). In this PR, we read the ColumnIndex, OffsetIndex, and BloomFilter into a cache to avoid the random seek. We got about 60 times performance improvement in production environments for the files with about one thousand columns.

ConeyLiu · 2023-10-17T11:43:27Z

Hi, @wgtmac @gszadovszky please help to review this when you are free. Thanks a lot.

ConeyLiu · 2023-10-17T11:47:30Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

@@ -265,6 +265,10 @@ private void processBlocksFromReader() throws IOException {
      BlockMetaData blockMetaData = meta.getBlocks().get(blockId);
      List<ColumnChunkMetaData> columnsInOrder = blockMetaData.getColumns();

+      List<ColumnIndex> columnIndexes = readAllColumnIndexes(reader, columnsInOrder, descriptorsMap);


I could add an option to it if someone is concerned about memory usage. This only caches the metadata for only one block and should be smaller than doing file writing which needs to cache all blocks' metadata.

Thanks for adding this! The change looks reasonable to me. I would suggest adding a new class to specifically cache and read these indexes. The new class have methods like readBloomFilter(), readColumnIndex() and readOffsetIndex() for a specific column path, and can be configured to cache required columns in advance. With this new class, we can do more optimizations including evict consumed items out of cache and use async I/O prefetch to load items. We can split them into separate patches. For the first one, we may simply add the new class without any caching (i.e. no behavior change). WDYT?

@wgtmac thanks for your suggestions. Do you mean to read the indexes column by column to reduce memory footprint? The suggested way should have less memory usage. The indexes are stored as the following from my understanding:

// column index block1_col1_column_index ... block1_coln_column_index block2_col1_column_index ... block2_coln_column_index ... // offset index block1_col1_offset_index ... block1_coln_offset_index block2_col1_offset_index ... block2_coln_offset_index ... // bloom index block1_col1_bloom_index ... block1_coln_bloom_index block2_col1_bloom_index ... block2_coln_bloom_index ...

So the problem would be we still need to do random seek for a single rowgroup(3 * number of columns). The async I/O should be helpful for the random seek performance. With this PR, we only need 3 times random seek (except the column pruning) for a single rowgroup.

Do you mean to read the indexes column by column to reduce memory footprint?

No, my suggested interface does not restrict any implementation detail, at least they should be ready at the readXXX() call. You can still read all indexes at once (controlled by a config). We can configurated it to release any consumed index object to reduce memory footprint.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/IndexCacher.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java

wgtmac

Thanks for the quick response! I have left some comments. Let me know what you think.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/IndexCacher.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestIndexCache.java

wgtmac

Thanks for the change! This is looking good to me.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/PrefetchIndexCache.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/IndexCache.java

…hIndexCache.java Co-authored-by: Gang Wu <[email protected]>

wgtmac

Thanks! LGTM.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/IndexCache.java

wgtmac · 2023-10-26T05:27:15Z

cc @gszadovszky @ggershinsky @shangxinli if this interests you.

ConeyLiu · 2023-10-30T02:45:35Z

Thanks @wgtmac @gszadovszky

ConeyLiu added 2 commits October 17, 2023 18:18

Avoid random seek

175f9f9

remove blank line

d1a7e0a

ConeyLiu commented Oct 17, 2023

View reviewed changes

indexcacher

e22a770

ConeyLiu commented Oct 18, 2023

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/IndexCacher.java Outdated Show resolved Hide resolved

ConeyLiu commented Oct 18, 2023

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java Outdated Show resolved Hide resolved

ConeyLiu added 3 commits October 18, 2023 22:34

update

e05ded6

update

bbb3db8

update

017cbe6

wgtmac requested changes Oct 19, 2023

View reviewed changes

interface IndexCache

458a92c

ConeyLiu commented Oct 19, 2023

View reviewed changes

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestIndexCache.java Outdated Show resolved Hide resolved

ConeyLiu commented Oct 19, 2023

View reviewed changes

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestIndexCache.java Outdated Show resolved Hide resolved

wgtmac reviewed Oct 20, 2023

View reviewed changes

ConeyLiu and others added 2 commits October 20, 2023 14:42

Update parquet-hadoop/src/main/java/org/apache/parquet/hadoop/Prefetc…

e19e8f4

…hIndexCache.java Co-authored-by: Gang Wu <[email protected]>

address comments

831f418

wgtmac approved these changes Oct 25, 2023

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/IndexCache.java Show resolved Hide resolved

rename

a637b11

wgtmac approved these changes Oct 26, 2023

View reviewed changes

gszadovszky approved these changes Oct 26, 2023

View reviewed changes

wgtmac merged commit 514cc6c into apache:master Oct 30, 2023
9 checks passed

ConeyLiu deleted the optimize-random-seek branch October 30, 2023 02:45

wgtmac mentioned this pull request Aug 31, 2024

Make ParquetFileReader extensible #3006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2366: Optimize random seek during rewriting #1174

PARQUET-2366: Optimize random seek during rewriting #1174

ConeyLiu commented Oct 17, 2023 •

edited

Loading

ConeyLiu commented Oct 17, 2023

ConeyLiu Oct 17, 2023

wgtmac Oct 18, 2023

ConeyLiu Oct 18, 2023

wgtmac Oct 18, 2023

wgtmac left a comment

wgtmac left a comment

wgtmac left a comment •

edited

Loading

wgtmac commented Oct 26, 2023

ConeyLiu commented Oct 30, 2023

PARQUET-2366: Optimize random seek during rewriting #1174

PARQUET-2366: Optimize random seek during rewriting #1174

Conversation

ConeyLiu commented Oct 17, 2023 • edited Loading

Jira

Tests

Commits

Documentation

ConeyLiu commented Oct 17, 2023

ConeyLiu Oct 17, 2023

Choose a reason for hiding this comment

wgtmac Oct 18, 2023

Choose a reason for hiding this comment

ConeyLiu Oct 18, 2023

Choose a reason for hiding this comment

wgtmac Oct 18, 2023

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac left a comment • edited Loading

Choose a reason for hiding this comment

wgtmac commented Oct 26, 2023

ConeyLiu commented Oct 30, 2023

ConeyLiu commented Oct 17, 2023 •

edited

Loading

wgtmac left a comment •

edited

Loading