Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect conversion of pyarrow interval value to datafusion literal #665

Closed
Tracked by #727
timsaucer opened this issue May 5, 2024 · 2 comments · Fixed by #728
Closed
Tracked by #727

Incorrect conversion of pyarrow interval value to datafusion literal #665

timsaucer opened this issue May 5, 2024 · 2 comments · Fixed by #728
Labels
bug Something isn't working

Comments

@timsaucer
Copy link
Contributor

Describe the bug
When creating a literal interval value from a pyarrow scalar, the values for month, day, and nanoseconds are not correctly assigned in the literal values. The following minimal example will reproduce. This appears to be limited to datafusion-python and not the rust implementation.

To Reproduce

print("Setting 1 month interval:")
pa_interval = pa.scalar((1, 0, 0), type=pa.month_day_nano_interval())
print("pa_interval:", pa_interval)

lit_interval = lit(pa_interval)
print("lit_interval:", lit_interval)

df.select(lit_interval).limit(1).show()

print("Setting 1 day interval:")
pa_interval = pa.scalar((0, 1, 0), type=pa.month_day_nano_interval())
print("pa_interval:", pa_interval)

lit_interval = lit(pa_interval)
print("lit_interval:", lit_interval)

df.select(lit_interval).limit(1).show()

print("Setting 1 nanosecond interval:")
pa_interval = pa.scalar((0, 0, 1), type=pa.month_day_nano_interval())
print("pa_interval:", pa_interval)

lit_interval = lit(pa_interval)
print("lit_interval:", lit_interval)

df.select(lit_interval).limit(1).show()

Produces the following result:

Setting 1 month interval:
pa_interval: MonthDayNano(months=1, days=0, nanoseconds=0)
lit_interval: Expr(IntervalMonthDayNano("1"))
DataFrame()
+-------------------------------------------------------+
| IntervalMonthDayNano("1")                             |
+-------------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 0.000000001 secs |
+-------------------------------------------------------+
Setting 1 day interval:
pa_interval: MonthDayNano(months=0, days=1, nanoseconds=0)
lit_interval: Expr(IntervalMonthDayNano("4294967296"))
DataFrame()
+-------------------------------------------------------+
| IntervalMonthDayNano("4294967296")                    |
+-------------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 4.294967296 secs |
+-------------------------------------------------------+
Setting 1 nanosecond interval:
pa_interval: MonthDayNano(months=0, days=0, nanoseconds=1)
lit_interval: Expr(IntervalMonthDayNano("18446744073709551616"))
DataFrame()
+-------------------------------------------------------+
| IntervalMonthDayNano("18446744073709551616")          |
+-------------------------------------------------------+
| 0 years 0 mons 1 days 0 hours 0 mins 0.000000000 secs |
+-------------------------------------------------------+

Expected behavior
When setting an interval value of 1 month in pyarrow, it should show up as 1 month in the datafusion data frame, and so on for the other values.

Additional context
Add any other context about the problem here.

@timsaucer timsaucer added the bug Something isn't working label May 5, 2024
@Michael-J-Ward
Copy link
Contributor

It's probably related to this issue in arrow-rs: Rust Interval definition is incorrect.

Here's a godbolt link demonstrating the "1 month becomes 1 nanosecond" example. (I based that on a comment in a similar thread in duckdb-wasm).

I would suspect that if all code paths use the same impl, then datafusion-python wouldn't notice it, but perhaps that's wrong, or maybe not all code-paths use arrow-rs?

The error occurs in the pyo3 magic as we cross the python -> rust bridge. Notice the python side assert's pyarrow.Scalar, but then the rust-side receives a datafusion::ScalarValue. (aside: is this magic type conversion intentional?)

python side:

def literal(value):
if not isinstance(value, pa.Scalar):
value = pa.scalar(value)
return Expr.literal(value)

rust side:

#[staticmethod]
pub fn literal(value: ScalarValue) -> PyExpr {
lit(value).into()
}

The error is already present before the rust-method is invoked, adding print statements on both sides of the bridge:

converting: MonthDayNano(months=1, days=0, nanoseconds=0)
converting: IntervalMonthDayNano("1")

@timsaucer
Copy link
Contributor Author

TODO: If the PR #666 merges in before this issues is corrected, the following examples in the examples/tpch folder will need to be updated

  • q01_pricing_summary_report.py
  • q04_order_priority_checking.py
  • q05_local_supplier_volume.py
  • q06_forecasting_revenue_change.py
  • q10_returned_item_reporting.py
  • q12_ship_mode_order_priority.py
  • q14_promotion_effect.py
  • q15_top_supplier.py
  • q20_potential_part_promotion.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants