Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37111: [C++][Parquet] Dataset: Fixing Schema Cast #37793

Merged

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Sep 19, 2023

Rationale for this change

Parquet and Arrow has two schema:

  1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it.
  2. Parquet arrow stores schema and possible field_id in key_value_metadata[2] when store_schema enabled. When deserializing, it will first parse SchemaElement[1], and if self-defined key_value_metadata exists, it will use key_value_metadata to override the [1]

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356
[2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033

The bug raise from that, when dataset parsing SchemaManifest, it doesn't use key_value_metadata from Metadata, which raises the problem.

For duration, when store_schema enabled, it will store Int64 as physical type, and add a ::arrow::Duration in key_value_metadata. And there is no equal(Duration, i64). So raise the un-impl

What changes are included in this PR?

Set key_value_metadata in implemented.

Are these changes tested?

Yes

Are there any user-facing changes?

bugfix

@mapleFU
Copy link
Member Author

mapleFU commented Sep 19, 2023

@bkietz I've rewrite a fixing, which is much more clear with testing. The previous fixing is ugly and wrong, so sorry for that.

@github-actions github-actions bot added the awaiting review Awaiting review label Sep 19, 2023
@mapleFU mapleFU changed the title GH-37111: [C++] Dataset: Fixing Parquet Schema Cast GH-37111: [C++][Parquet] Dataset: Fixing Schema Cast Sep 19, 2023
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just a small comment

@@ -104,11 +104,12 @@ parquet::ArrowReaderProperties MakeArrowReaderProperties(
return arrow_properties;
}

template <typename M>
Result<std::shared_ptr<SchemaManifest>> GetSchemaManifest(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I wonder when that became unnecessary

auto manifest = std::make_shared<SchemaManifest>();
const std::shared_ptr<const ::arrow::KeyValueMetadata>& key_value_metadata = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬

@@ -703,6 +707,25 @@ TEST_P(TestParquetFileFormatScan, PredicatePushdownRowGroupFragmentsUsingStringC
CountRowGroupsInFragment(fragment, {0, 3}, equal(field_ref("x"), literal("a")));
}

TEST_P(TestParquetFileFormatScan, PredicatePushdownRowGroupFragmentsUsingDurationColumn) {
auto table = TableFromJSON(schema({field("t", duration(TimeUnit::NANO))}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind also checking for time32 since that was mentioned in the issue? If it isn't fixed by this patch then we should make a follow up issue to address it

Suggested change
auto table = TableFromJSON(schema({field("t", duration(TimeUnit::NANO))}),
for (auto type : {duration(TimeUnit::NANO), time32(TimeUnit::SECOND)}) {
auto table = TableFromJSON(schema({field("t", type)}),

Also, please describe the bug in GH-37111 and how this test checks it

Copy link
Member Author

@mapleFU mapleFU Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a different problem, after check this patch not fixed that, I'll create another issue and try to fix it before 14.0. I've created an issue: #37799

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 20, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Sep 20, 2023

@bkietz I've write a test about Timestamp in my local machine:

TEST_P(TestParquetFileFormatScan, PredicatePushdownRowGroupFragmentsUsingTimestampColumn) {
  auto table = TableFromJSON(schema({field("t", time32(TimeUnit::SECOND))}),
                             {
                                 R"([{"t": 1}])",
                                 R"([{"t": 2}, {"t": 3}])",
                             });
  TableBatchReader table_reader(*table);
  SCOPED_TRACE("TestParquetFileFormatScan.PredicatePushdownRowGroupFragmentsUsingTimestampColumn");
  ASSERT_OK_AND_ASSIGN(
      auto buffer,
      ParquetFormatHelper::Write(
          &table_reader, ArrowWriterProperties::Builder().store_schema()->build()));
  auto source = std::make_shared<FileSource>(buffer);
  SetSchema({field("t", time32(TimeUnit::SECOND))});
  ASSERT_OK_AND_ASSIGN(auto fragment, format_->MakeFragment(*source));

  auto expr = equal(field_ref("t"), literal(::arrow::Time32Scalar(1, TimeUnit::SECOND)));
  CountRowGroupsInFragment(fragment, {0}, expr);
}

The test failed because:

 Function 'equal' has no kernel matching input types (time32[s], time32[ms])
  1. Parquet doesn't have second in parquet, so it was convert to store as ms
  2. In our expr system, we use equal(time32[s], time32[ms])

I'll separate a pr for this.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 20, 2023
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@bkietz bkietz merged commit 008d277 into apache:main Sep 20, 2023
@bkietz bkietz removed the awaiting change review Awaiting change review label Sep 20, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Sep 20, 2023
@mapleFU mapleFU deleted the parquet-dataset/use-key-value-metadata-correctly branch September 20, 2023 12:11
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 008d277.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
)

### Rationale for this change

Parquet and Arrow has two schema:
1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it.
2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1]

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356
[2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033

The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem.

For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl

### What changes are included in this PR?

Set `key_value_metadata` in implemented.

### Are these changes tested?

Yes

### Are there any user-facing changes?

bugfix

* Closes: apache#37111

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
)

### Rationale for this change

Parquet and Arrow has two schema:
1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it.
2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1]

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356
[2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033

The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem.

For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl

### What changes are included in this PR?

Set `key_value_metadata` in implemented.

### Are these changes tested?

Yes

### Are there any user-facing changes?

bugfix

* Closes: apache#37111

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset filtering from disk broken for duration type
2 participants