Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Dataset cannot filter timestamp with time_unit SECOND #37799

Closed
mapleFU opened this issue Sep 20, 2023 · 5 comments · Fixed by #37949
Closed

[C++][Parquet] Dataset cannot filter timestamp with time_unit SECOND #37799

mapleFU opened this issue Sep 20, 2023 · 5 comments · Fixed by #37949

Comments

@mapleFU
Copy link
Member

mapleFU commented Sep 20, 2023

Describe the enhancement requested

This is a part of #37111 .

TEST_P(TestParquetFileFormatScan, PredicatePushdownRowGroupFragmentsUsingTimestampColumn) {
  for (auto time_unit : {TimeUnit::MILLI, TimeUnit::SECOND}) {
    auto table = TableFromJSON(schema({field("t", time32(time_unit))}),
                               {
                                   R"([{"t": 1}])",
                                   R"([{"t": 2}, {"t": 3}])",
                               });
    TableBatchReader table_reader(*table);
    SCOPED_TRACE(
        "TestParquetFileFormatScan.PredicatePushdownRowGroupFragmentsUsingTimestampColumn");
    ASSERT_OK_AND_ASSIGN(
        auto buffer,
        ParquetFormatHelper::Write(
            &table_reader, ArrowWriterProperties::Builder().store_schema()->build()));
    auto source = std::make_shared<FileSource>(buffer);
    SetSchema({field("t", time32(time_unit))});
    ASSERT_OK_AND_ASSIGN(auto fragment, format_->MakeFragment(*source));

    auto expr =
        equal(field_ref("t"), literal(::arrow::Time32Scalar(1, time_unit)));
    CountRowGroupsInFragment(fragment, {0}, expr);
  }
}

With the reproduce code above, TimeUnit::MILLI will pass the test, but TimeUnit::SECOND will suffer from

function 'equal' has no kernel matching input types (time32[s], time32[ms])

This is because:

  1. During writing, SECOND will be cast to other type, see [1]
  2. During reading, we don't support eq(time32[s], time32[ms)

[1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L231-L238

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Sep 26, 2023

@bkietz I think the root cause is that:

  1. Parquet internal will do some "cast". it doesn't support some types
  2. After doing the cast, the data type might not equal to origin type
  3. And when applying expression, the expression failed to apply.

So, we can have some fixing:

  1. Trying to "Cast" Parquet type to origin type before applying the expression
  2. Support Equal as more as possible(in this case, supports eq(time32[s], time32[ms))

What would you think is ok to fix this issue?

@bkietz
Copy link
Member

bkietz commented Sep 27, 2023

I think in the presence of an origin schema, I think it's an error to prefer the inferred type. I'm a little surprised we don't already prefer the origin schema in all cases.

Relatedly but as a separate issue, I think it's reasonable to support a common type between time32[s] and time32[ms]. That would be added to CommonTemporal in codegen_internal.cc https://github.com/bkietz/arrow/blob/ef1e1b84c32351729890eeb11c78b88fa852da94/cpp/src/arrow/compute/kernels/codegen_internal.cc#L248-L258

@mapleFU
Copy link
Member Author

mapleFU commented Sep 29, 2023

Sorry for delay for days, I caught a cold previously, let me finish it today

bkietz pushed a commit that referenced this issue Oct 2, 2023
…ting (#37949)

### Rationale for this change

The original problem in mentioned in #37799

1. `SECOND` in Parquet would always cast to `MILLS`
2. `eq` not support `eq(time32[s], time32[ms)`

This patch is as advice in #37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`.

### What changes are included in this PR?

Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`.

### Are these changes tested?

Yes

### Are there any user-facing changes?

bugfix

* Closes: #37799

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@bkietz bkietz added this to the 14.0.0 milestone Oct 2, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Oct 3, 2023

I think in the presence of an origin schema, I think it's an error to prefer the inferred type. I'm a little surprised we don't already prefer the origin schema in all cases.

I think in a dataset ( Could be a iceberg dataset nowadays) with Parquet, we might meet the problems below:

  1. Parquet Schema cast arrow type to an other type.
  2. The file in schema has a schema evolution, like i32 -> i64, etc..

The two cases might all has schema not match problem. Should we:

  1. Implicit cast the schema before evalutate the expr?
  2. If expr meets problem, LOG ERROR or return fault?

@bkietz

@bkietz
Copy link
Member

bkietz commented Oct 3, 2023

I believe currently the only way to support a schema evolution (or other mismatch) less trivial than added columns or removed columns is to consider the fragments as separate datasets, cast (project) their batches independently, then use a union node to unify the streams. Adding implicit casts which work around this in some cases won't fix the general problem.

With that said, there are some projections which could be reasonably pushed down into the reader for some formats but not all. For example, the parquet reader can dictionary encode as it reads with no significant difference in performance whereas the IPC reader would need to spend much more CPU time to accomplish the same. I'm not sure what the API for specifying which projection pushdowns a format supports would look like.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…64 casting (apache#37949)

### Rationale for this change

The original problem in mentioned in apache#37799

1. `SECOND` in Parquet would always cast to `MILLS`
2. `eq` not support `eq(time32[s], time32[ms)`

This patch is as advice in apache#37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`.

### What changes are included in this PR?

Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`.

### Are these changes tested?

Yes

### Are there any user-facing changes?

bugfix

* Closes: apache#37799

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…64 casting (apache#37949)

### Rationale for this change

The original problem in mentioned in apache#37799

1. `SECOND` in Parquet would always cast to `MILLS`
2. `eq` not support `eq(time32[s], time32[ms)`

This patch is as advice in apache#37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`.

### What changes are included in this PR?

Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`.

### Are these changes tested?

Yes

### Are there any user-facing changes?

bugfix

* Closes: apache#37799

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…64 casting (apache#37949)

### Rationale for this change

The original problem in mentioned in apache#37799

1. `SECOND` in Parquet would always cast to `MILLS`
2. `eq` not support `eq(time32[s], time32[ms)`

This patch is as advice in apache#37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`.

### What changes are included in this PR?

Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`.

### Are these changes tested?

Yes

### Are there any user-facing changes?

bugfix

* Closes: apache#37799

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…64 casting (apache#37949)

### Rationale for this change

The original problem in mentioned in apache#37799

1. `SECOND` in Parquet would always cast to `MILLS`
2. `eq` not support `eq(time32[s], time32[ms)`

This patch is as advice in apache#37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`.

### What changes are included in this PR?

Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`.

### Are these changes tested?

Yes

### Are there any user-facing changes?

bugfix

* Closes: apache#37799

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants