-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Dataset cannot filter timestamp with time_unit SECOND #37799
Comments
@bkietz I think the root cause is that:
So, we can have some fixing:
What would you think is ok to fix this issue? |
I think in the presence of an origin schema, I think it's an error to prefer the inferred type. I'm a little surprised we don't already prefer the origin schema in all cases. Relatedly but as a separate issue, I think it's reasonable to support a common type between |
Sorry for delay for days, I caught a cold previously, let me finish it today |
…ting (#37949) ### Rationale for this change The original problem in mentioned in #37799 1. `SECOND` in Parquet would always cast to `MILLS` 2. `eq` not support `eq(time32[s], time32[ms)` This patch is as advice in #37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`. ### What changes are included in this PR? Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: #37799 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
I think in a dataset ( Could be a iceberg dataset nowadays) with Parquet, we might meet the problems below:
The two cases might all has schema not match problem. Should we:
|
I believe currently the only way to support a schema evolution (or other mismatch) less trivial than added columns or removed columns is to consider the fragments as separate datasets, cast (project) their batches independently, then use a union node to unify the streams. Adding implicit casts which work around this in some cases won't fix the general problem. With that said, there are some projections which could be reasonably pushed down into the reader for some formats but not all. For example, the parquet reader can dictionary encode as it reads with no significant difference in performance whereas the IPC reader would need to spend much more CPU time to accomplish the same. I'm not sure what the API for specifying which projection pushdowns a format supports would look like. |
…64 casting (apache#37949) ### Rationale for this change The original problem in mentioned in apache#37799 1. `SECOND` in Parquet would always cast to `MILLS` 2. `eq` not support `eq(time32[s], time32[ms)` This patch is as advice in apache#37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`. ### What changes are included in this PR? Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: apache#37799 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…64 casting (apache#37949) ### Rationale for this change The original problem in mentioned in apache#37799 1. `SECOND` in Parquet would always cast to `MILLS` 2. `eq` not support `eq(time32[s], time32[ms)` This patch is as advice in apache#37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`. ### What changes are included in this PR? Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: apache#37799 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…64 casting (apache#37949) ### Rationale for this change The original problem in mentioned in apache#37799 1. `SECOND` in Parquet would always cast to `MILLS` 2. `eq` not support `eq(time32[s], time32[ms)` This patch is as advice in apache#37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`. ### What changes are included in this PR? Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: apache#37799 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…64 casting (apache#37949) ### Rationale for this change The original problem in mentioned in apache#37799 1. `SECOND` in Parquet would always cast to `MILLS` 2. `eq` not support `eq(time32[s], time32[ms)` This patch is as advice in apache#37799 (comment) . We tent to add time32 and time64 in `CommonTemporal`. ### What changes are included in this PR? Support time32 and time64 with different time unit in `arrow::compute::internal::CommonTemporal`. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: apache#37799 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Describe the enhancement requested
This is a part of #37111 .
With the reproduce code above,
TimeUnit::MILLI
will pass the test, butTimeUnit::SECOND
will suffer fromThis is because:
SECOND
will be cast to other type, see [1]eq(time32[s], time32[ms)
[1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L231-L238
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: