-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset filtering from disk broken for duration type #37111
Comments
Note that it's complaining about a comparison between microseconds and seconds
Which is IMO a good thing because you don't want to compare different time units. You might want to round the |
@felipecrv I agree... except that nowhere in the code I presented is the type |
Also I found a weird thing,
The first time it would raise ex, but later it works 🤔can you first retry as a workaround? I'll try to see why |
using TestDurationParquetIO = TestParquetIO<::arrow::DurationType>;
TEST_F(TestDurationParquetIO, Roundtrip) {
std::vector<bool> is_valid = {true, true, false, true};
std::vector<int64_t> values = {1, 2, 3, 4};
std::shared_ptr<Array> int_array, duration_arr;
::arrow::ArrayFromVector<::arrow::Int64Type, int64_t>(::arrow::int64(), is_valid,
values, &int_array);
::arrow::ArrayFromVector<::arrow::DurationType, int64_t>(
::arrow::duration(TimeUnit::NANO), is_valid, values, &duration_arr);
// When the original Arrow schema isn't stored, a Duration array comes
// back as int64 (how it is stored in Parquet)
this->RoundTripSingleColumn(duration_arr, int_array, default_arrow_writer_properties());
// When the original Arrow schema is stored, the Duration array type is preserved
const auto arrow_properties =
::parquet::ArrowWriterProperties::Builder().store_schema()->build();
this->RoundTripSingleColumn(duration_arr, duration_arr, arrow_properties);
} Duration type is store as int64 in Parquet, and when the number is so small it's deduced as int8. So that's why we get int8. I'll se how can I fix that. |
I've find out the root cause. It's caused by implemention of |
I've submit a fix here: #37734 . You can have a try |
Awesome! Does this also fix the case where |
Hmm I'm not sure, let me check it tonight, currently I torward to fix the duration type bug, so I'm not sure You can first use a "Cast" as a workaround, since merged patch will be release in 14.0.0 or even 15.0, which could be a long time |
@mattaubury I think they may have different root cause. Let me explain it.
|
cc @bkietz |
### Rationale for this change Parquet and Arrow has two schema: 1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it. 2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1] [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033 The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem. For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl ### What changes are included in this PR? Set `key_value_metadata` in implemented. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: #37111 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
) ### Rationale for this change Parquet and Arrow has two schema: 1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it. 2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1] [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033 The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem. For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl ### What changes are included in this PR? Set `key_value_metadata` in implemented. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: apache#37111 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
) ### Rationale for this change Parquet and Arrow has two schema: 1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it. 2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1] [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033 The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem. For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl ### What changes are included in this PR? Set `key_value_metadata` in implemented. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: apache#37111 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
Using
pyarrow-12.0.1
on RHEL8 Intel.dataset filters work when the table is read into memory, but break when the table is a referenced Parquet file. The following code demonstrates the issue:
The assert is not reached, instead we get the error:
Superficially, it appears as though the duration scalar is being prematurely unboxed.
The issue is not present with other value_types (such as
int64
,timestamp
,date32
), but a potentially related issue occurs withvalue_type = pa.time32("s")
:Component(s)
Python
The text was updated successfully, but these errors were encountered: