-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37111: [C++][Parquet] Dataset: Fixing Schema Cast #37793
GH-37111: [C++][Parquet] Dataset: Fixing Schema Cast #37793
Conversation
@bkietz I've rewrite a fixing, which is much more clear with testing. The previous fixing is ugly and wrong, so sorry for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just a small comment
@@ -104,11 +104,12 @@ parquet::ArrowReaderProperties MakeArrowReaderProperties( | |||
return arrow_properties; | |||
} | |||
|
|||
template <typename M> | |||
Result<std::shared_ptr<SchemaManifest>> GetSchemaManifest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I wonder when that became unnecessary
auto manifest = std::make_shared<SchemaManifest>(); | ||
const std::shared_ptr<const ::arrow::KeyValueMetadata>& key_value_metadata = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬
@@ -703,6 +707,25 @@ TEST_P(TestParquetFileFormatScan, PredicatePushdownRowGroupFragmentsUsingStringC | |||
CountRowGroupsInFragment(fragment, {0, 3}, equal(field_ref("x"), literal("a"))); | |||
} | |||
|
|||
TEST_P(TestParquetFileFormatScan, PredicatePushdownRowGroupFragmentsUsingDurationColumn) { | |||
auto table = TableFromJSON(schema({field("t", duration(TimeUnit::NANO))}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind also checking for time32 since that was mentioned in the issue? If it isn't fixed by this patch then we should make a follow up issue to address it
auto table = TableFromJSON(schema({field("t", duration(TimeUnit::NANO))}), | |
for (auto type : {duration(TimeUnit::NANO), time32(TimeUnit::SECOND)}) { | |
auto table = TableFromJSON(schema({field("t", type)}), |
Also, please describe the bug in GH-37111 and how this test checks it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a different problem, after check this patch not fixed that, I'll create another issue and try to fix it before 14.0. I've created an issue: #37799
@bkietz I've write a test about Timestamp in my local machine: TEST_P(TestParquetFileFormatScan, PredicatePushdownRowGroupFragmentsUsingTimestampColumn) {
auto table = TableFromJSON(schema({field("t", time32(TimeUnit::SECOND))}),
{
R"([{"t": 1}])",
R"([{"t": 2}, {"t": 3}])",
});
TableBatchReader table_reader(*table);
SCOPED_TRACE("TestParquetFileFormatScan.PredicatePushdownRowGroupFragmentsUsingTimestampColumn");
ASSERT_OK_AND_ASSIGN(
auto buffer,
ParquetFormatHelper::Write(
&table_reader, ArrowWriterProperties::Builder().store_schema()->build()));
auto source = std::make_shared<FileSource>(buffer);
SetSchema({field("t", time32(TimeUnit::SECOND))});
ASSERT_OK_AND_ASSIGN(auto fragment, format_->MakeFragment(*source));
auto expr = equal(field_ref("t"), literal(::arrow::Time32Scalar(1, TimeUnit::SECOND)));
CountRowGroupsInFragment(fragment, {0}, expr);
} The test failed because:
I'll separate a pr for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 008d277. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
) ### Rationale for this change Parquet and Arrow has two schema: 1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it. 2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1] [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033 The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem. For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl ### What changes are included in this PR? Set `key_value_metadata` in implemented. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: apache#37111 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
) ### Rationale for this change Parquet and Arrow has two schema: 1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it. 2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1] [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033 The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem. For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl ### What changes are included in this PR? Set `key_value_metadata` in implemented. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: apache#37111 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Rationale for this change
Parquet and Arrow has two schema:
field_id
inkey_value_metadata
[2] whenstore_schema
enabled. When deserializing, it will first parseSchemaElement
[1], and if self-defined key_value_metadata exists, it will usekey_value_metadata
to override the [1][1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356
[2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033
The bug raise from that, when dataset parsing
SchemaManifest
, it doesn't usekey_value_metadata
fromMetadata
, which raises the problem.For duration, when
store_schema
enabled, it will storeInt64
as physical type, and add a::arrow::Duration
inkey_value_metadata
. And there is noequal(Duration, i64)
. So raise the un-implWhat changes are included in this PR?
Set
key_value_metadata
in implemented.Are these changes tested?
Yes
Are there any user-facing changes?
bugfix