-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41579: [C++][Python][Parquet] Support reading/writing key-value metadata from/to ColumnChunkMetaData #41580
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea general LGTM
I saw a related issue: https://issues.apache.org/jira/browse/PARQUET-1648. It seems that parquet-mr does not use it yet. |
Can parquet-mr reads that? |
🤔at least this patch is ok and seems other implementation has thrift, it doesn't break the standard.. |
I agree. It should not block us from implementing this. |
@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR? |
Created apache/parquet-testing#49 |
@clee704 Sorry for delaying, I've update this patch myself. This patch LGTM and I'm willing to merge it before August 6th. Would you mind re-checking this? |
Would check CPython build
|
Would merge this week if no negative comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the C++ side
@pitrou Would you mind take a look at this? I'm planning to merge this pr this week if no negative comments |
|
||
TEST_F(TestInt32Writer, WriteKeyValueMetadata) { | ||
auto writer = this->BuildWriter(); | ||
writer->AddKeyValueMetadata(KeyValueMetadata::Make({"hello"}, {"world"})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test where AddKeyValueMetadata
is called twice?
cpp/src/parquet/metadata.cc
Outdated
@@ -135,6 +135,36 @@ std::shared_ptr<Statistics> MakeColumnStats(const format::ColumnMetaData& meta_d | |||
throw ParquetException("Can't decode page statistics for selected column type"); | |||
} | |||
|
|||
template <typename Metadata> | |||
std::shared_ptr<KeyValueMetadata> CopyKeyValueMetadata(const Metadata& source) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps call this FromThriftKeyValueMetadata
?
Sorry, I approved while the two comments should probably be addressed before merging this. |
Thanks all, merged! |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 2767dc5. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them. |
@pytest.fixture(scope='module') | ||
def parquet_test_datadir(): | ||
result = os.environ.get('PARQUET_TEST_DATA') | ||
if not result: | ||
raise RuntimeError('Please point the PARQUET_TEST_DATA environment ' | ||
'variable to the test data directory') | ||
return pathlib.Path(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time we introduce a pyarrow test that requires this to be set up. Do we want to require that strictly, or should we skip the test if the env variable is not set?
In any case this caused a failure in one of the nightly crossbow builds which doesn't have this env variable set (python-emscriptem, https://github.com/ursacomputing/crossbow/actions/runs/10463078918/job/28974494645#step:7:15654)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both is ok for me but I prefer read that 🤔
…43786) ### Rationale for this change Starting with #41580, the pyarrow tests now also rely on a file in the parquet-testing submodule. And the path to that directory is controlled by `PARQUET_TEST_DATA`, which appears to be set wrongly in the wheel test scripts, causing all wheel builds to fail at the moment. * GitHub Issue: #43785 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…on emscripten (#43906) ### Rationale for this change The following PR: - #41580 Made mandatory for a test the requirement to have `PARQUET_TEST_DATA` env defined. This is currently not available from `python_test_emscripten.sh` as we require to mount the filesystem for both Node and ChromeDriver. ### What changes are included in this PR? Skip the test that requires `PARQUET_TEST_DATA` for emscripten. ### Are these changes tested? Via archery ### Are there any user-facing changes? No * GitHub Issue: #43905 * GitHub Issue: #43868 Lead-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Rationale for this change
Parquet standard allows reading/writing key-value metadata from/to ColumnChunkMetaData, but there is no way to do that with Parquet C++.
What changes are included in this PR?
Support reading/writing key-value metadata from/to ColumnChunkMetaData with Parquet C++ reader/writer. Support reading key-value metadata from ColumnChunkMetaData with pyarrow.parquet.
Are these changes tested?
Yes, unit tests are added
Are there any user-facing changes?
Yes.
--print-key-value-metadata
option is used.