-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41431: [C++][Parquet][Dataset] Fix repeated scan on encrypted dataset #41550
Conversation
|
@pitrou @jorisvandenbossche Would you mind taking a look at this? |
cpp/src/parquet/file_reader.cc
Outdated
@@ -373,6 +373,7 @@ class SerializedFile : public ParquetFileReader::Contents { | |||
|
|||
void set_metadata(std::shared_ptr<FileMetaData> metadata) { | |||
file_metadata_ = std::move(metadata); | |||
file_decryptor_ = file_metadata_->file_decryptor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing, here we set file_decryptor_
from file_metadata_
, but in ParseUnencryptedFileMetadata
and ParseMetaDataOfEncryptedFileWithPlaintextFooter
we set file_metadata_
from file_decryptor_
. Can we please make this consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L775-L780
I think these are in two different directions while opening a parquet reader:
- For
ParseXXX
functions where we parse footer to createfile_decryptor_
andfile_metadata_
. We need to setfile_decryptor_
tofile_metadata_
so that bothSerializedFile
andFileMetaData
have a copy of the decryptor. - For
set_metadata
function, we already have the cachedFileMetaData
but need to createSerializedFile
where itsfile_decryptor_
is null. Therefore we need to getfile_decryptor_
fromfile_metadata_
.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this begs the question: do we need a file_decryptor_
field here? We could just get it from the metadata everytime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Let me consolidate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed file_decryptor_
from SerializedFile
and SerializedRowGroup
. Let me know WDYT. @pitrou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this. Subset or cache file metadata might also a normal use case of parquet, which can build in different way from direct read meta. I'm +1 on this
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5385926. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
@raulcd Should we port this to 16.1.0? |
I'll try to add it |
…set (#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: #41431 Authored-by: Gang Wu <[email protected]> Signed-off-by: Gang Wu <[email protected]>
…d dataset (apache#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: apache#41431 Authored-by: Gang Wu <[email protected]> Signed-off-by: Gang Wu <[email protected]>
Rationale for this change
When parquet dataset is reused to create multiple scanners,
FileMetaData
objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cachedFileMetaData
objects.What changes are included in this PR?
Expose file_decryptor from FileMetaData and set it properly.
Are these changes tested?
Yes, modify the test to reproduce the issue and assure fixed.
Are there any user-facing changes?
No.