Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2366: Optimize random seek during rewriting #1174

Merged
merged 10 commits into from
Oct 30, 2023

Conversation

ConeyLiu
Copy link
Contributor

@ConeyLiu ConeyLiu commented Oct 17, 2023

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

The ColunIndex, OffsetIndex, and BloomFilter are stored at the end of the file. We need to randomly seek 4 times when rewriting a column chunk. We found this could impact the rewrite performance heavily for files with a number of columns(~1000). In this PR, we read the ColumnIndex, OffsetIndex, and BloomFilter into a cache to avoid the random seek. We got about 60 times performance improvement in production environments for the files with about one thousand columns.

@ConeyLiu
Copy link
Contributor Author

Hi, @wgtmac @gszadovszky please help to review this when you are free. Thanks a lot.

@@ -265,6 +265,10 @@ private void processBlocksFromReader() throws IOException {
BlockMetaData blockMetaData = meta.getBlocks().get(blockId);
List<ColumnChunkMetaData> columnsInOrder = blockMetaData.getColumns();

List<ColumnIndex> columnIndexes = readAllColumnIndexes(reader, columnsInOrder, descriptorsMap);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add an option to it if someone is concerned about memory usage. This only caches the metadata for only one block and should be smaller than doing file writing which needs to cache all blocks' metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! The change looks reasonable to me. I would suggest adding a new class to specifically cache and read these indexes. The new class have methods like readBloomFilter(), readColumnIndex() and readOffsetIndex() for a specific column path, and can be configured to cache required columns in advance. With this new class, we can do more optimizations including evict consumed items out of cache and use async I/O prefetch to load items. We can split them into separate patches. For the first one, we may simply add the new class without any caching (i.e. no behavior change). WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac thanks for your suggestions. Do you mean to read the indexes column by column to reduce memory footprint? The suggested way should have less memory usage. The indexes are stored as the following from my understanding:

// column index
block1_col1_column_index
...
block1_coln_column_index
block2_col1_column_index
...
block2_coln_column_index
...

// offset index
block1_col1_offset_index
...
block1_coln_offset_index
block2_col1_offset_index
...
block2_coln_offset_index
...

// bloom index
block1_col1_bloom_index
...
block1_coln_bloom_index
block2_col1_bloom_index
...
block2_coln_bloom_index
...

So the problem would be we still need to do random seek for a single rowgroup(3 * number of columns). The async I/O should be helpful for the random seek performance. With this PR, we only need 3 times random seek (except the column pruning) for a single rowgroup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to read the indexes column by column to reduce memory footprint?

No, my suggested interface does not restrict any implementation detail, at least they should be ready at the readXXX() call. You can still read all indexes at once (controlled by a config). We can configurated it to release any consumed index object to reduce memory footprint.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response! I have left some comments. Let me know what you think.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! This is looking good to me.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM.

@wgtmac
Copy link
Member

wgtmac commented Oct 26, 2023

cc @gszadovszky @ggershinsky @shangxinli if this interests you.

@wgtmac wgtmac merged commit 514cc6c into apache:master Oct 30, 2023
9 checks passed
@ConeyLiu
Copy link
Contributor Author

Thanks @wgtmac @gszadovszky

@ConeyLiu ConeyLiu deleted the optimize-random-seek branch October 30, 2023 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants