Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41431: [C++][Parquet][Dataset] Fix repeated scan on encrypted dataset #41550

Merged
merged 2 commits into from
May 8, 2024

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented May 6, 2024

Rationale for this change

When parquet dataset is reused to create multiple scanners, FileMetaData objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached FileMetaData objects.

What changes are included in this PR?

Expose file_decryptor from FileMetaData and set it properly.

Are these changes tested?

Yes, modify the test to reproduce the issue and assure fixed.

Are there any user-facing changes?

No.

Copy link

github-actions bot commented May 6, 2024

⚠️ GitHub issue #41431 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member Author

wgtmac commented May 6, 2024

@pitrou @jorisvandenbossche Would you mind taking a look at this?

@@ -373,6 +373,7 @@ class SerializedFile : public ParquetFileReader::Contents {

void set_metadata(std::shared_ptr<FileMetaData> metadata) {
file_metadata_ = std::move(metadata);
file_decryptor_ = file_metadata_->file_decryptor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing, here we set file_decryptor_ from file_metadata_, but in ParseUnencryptedFileMetadata and ParseMetaDataOfEncryptedFileWithPlaintextFooter we set file_metadata_ from file_decryptor_. Can we please make this consistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L775-L780

I think these are in two different directions while opening a parquet reader:

  • For ParseXXX functions where we parse footer to create file_decryptor_ and file_metadata_. We need to set file_decryptor_ to file_metadata_ so that both SerializedFile and FileMetaData have a copy of the decryptor.
  • For set_metadata function, we already have the cached FileMetaData but need to create SerializedFile where its file_decryptor_ is null. Therefore we need to get file_decryptor_ from file_metadata_.

Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this begs the question: do we need a file_decryptor_ field here? We could just get it from the metadata everytime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Let me consolidate them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed file_decryptor_ from SerializedFile and SerializedRowGroup. Let me know WDYT. @pitrou

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 6, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this. Subset or cache file metadata might also a normal use case of parquet, which can build in different way from direct read meta. I'm +1 on this

@wgtmac
Copy link
Member Author

wgtmac commented May 8, 2024

I'll merge this. Thanks @mapleFU and @pitrou for the review!

@wgtmac wgtmac merged commit 5385926 into apache:main May 8, 2024
33 of 38 checks passed
@wgtmac wgtmac removed the awaiting committer review Awaiting committer review label May 8, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5385926.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

@wgtmac
Copy link
Member Author

wgtmac commented May 9, 2024

@raulcd Should we port this to 16.1.0?

@raulcd
Copy link
Member

raulcd commented May 9, 2024

@raulcd Should we port this to 16.1.0?

I'll try to add it

raulcd pushed a commit that referenced this pull request May 9, 2024
…set (#41550)

### Rationale for this change

When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects.

### What changes are included in this PR?

Expose file_decryptor from FileMetaData and set it properly.

### Are these changes tested?

Yes, modify the test to reproduce the issue and assure fixed.

### Are there any user-facing changes?

No.
* GitHub Issue: #41431

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Gang Wu <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…d dataset (apache#41550)

### Rationale for this change

When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects.

### What changes are included in this PR?

Expose file_decryptor from FileMetaData and set it properly.

### Are these changes tested?

Yes, modify the test to reproduce the issue and assure fixed.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41431

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Gang Wu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants