Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for month & year intervals #2797

Merged
merged 19 commits into from
Jul 12, 2022

Conversation

avantgardnerio
Copy link
Contributor

Which issue does this PR close?

Closes #2796.

Rationale for this change

In order to pass TPC-H benchmarks, datafusion will need to be able to support month & year date intervals.

What changes are included in this PR?

A refactoring of the datetime module to:

  1. Support IntervalYearMonth in addition to the currently supported IntervalDayTime
  2. Perform some non-trivial date modulo math (that I hopefully got correct)
  3. Add tests to ensure modulo math edge cases work correctly
  4. Switch to an arguably more readable railroad error handling pattern to take better advantage of the error propagation (?) operator.

Are there any user-facing changes?

More queries should work. Nothing that worked previously should break.

@github-actions github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Jun 26, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 26, 2022

Codecov Report

Merging #2797 (8ba1efe) into master (364c9cc) will increase coverage by 0.09%.
The diff coverage is 93.94%.

@@            Coverage Diff             @@
##           master    #2797      +/-   ##
==========================================
+ Coverage   85.25%   85.34%   +0.09%     
==========================================
  Files         275      276       +1     
  Lines       49001    49281     +280     
==========================================
+ Hits        41774    42061     +287     
+ Misses       7227     7220       -7     
Impacted Files Coverage Δ
datafusion/common/src/scalar.rs 75.77% <ø> (ø)
datafusion/physical-expr/src/expressions/mod.rs 100.00% <ø> (ø)
...tafusion/physical-expr/src/expressions/datetime.rs 86.80% <89.44%> (+54.14%) ⬆️
datafusion/core/tests/sql/timestamp.rs 100.00% <100.00%> (ø)
datafusion/optimizer/src/simplify_expressions.rs 82.00% <100.00%> (+0.02%) ⬆️
datafusion/physical-expr/src/expressions/delta.rs 100.00% <100.00%> (ø)
datafusion/expr/src/logical_plan/plan.rs 74.31% <0.00%> (-0.20%) ⬇️
datafusion/expr/src/window_frame.rs 93.27% <0.00%> (+0.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 364c9cc...8ba1efe. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @avantgardnerio -- this is looking like a great start. I don't fully understand some of the calculations but I may be misunderstanding

cc @ovr

datafusion/common/src/scalar.rs Outdated Show resolved Hide resolved
datafusion/core/tests/sql/timestamp.rs Show resolved Hide resolved
datafusion/physical-expr/src/expressions/datetime.rs Outdated Show resolved Hide resolved
};

// Add interval
let posterior = match scalar {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would b great to eventually put some/all of this logic into the arrow-rs kernels -- for example

https://docs.rs/arrow/16.0.0/arrow/compute/kernels/temporal/index.html

or perhaps https://docs.rs/arrow/16.0.0/arrow/compute/kernels/arithmetic/fn.add.html

That would also likely result in support for columnar execution support (aka adding a column of integers)

Maybe we can (I will file a ticket) start with kernels in datafusion and then port them to arrow.rs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaiveDate::from_ymd(target.year(), target.month(), day)
}

fn chrono_add_months(dt: NaiveDate, delta: i32) -> NaiveDate {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think I would like to PR this to chrono when I get a chance, and we can remove it from datafusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - I like filing a ticket / PR in the target repo and then leaving a link in the comments of DataFusion

Btw I poked around and found this code in arrow2 (from @jorgecarleitao ) that is similar: https://docs.rs/arrow2/latest/src/arrow2/temporal_conversions.rs.html#342-368

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @avantgardnerio I think this PR's code is looking good

The only thing I think it needs before merging are tests covering more of the date/time conversion logic. Once that is done I can write up a "move this to arrow" type ticket as well

NaiveDate::from_ymd(target.year(), target.month(), day)
}

fn chrono_add_months(dt: NaiveDate, delta: i32) -> NaiveDate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - I like filing a ticket / PR in the target repo and then leaving a link in the comments of DataFusion

Btw I poked around and found this code in arrow2 (from @jorgecarleitao ) that is similar: https://docs.rs/arrow2/latest/src/arrow2/temporal_conversions.rs.html#342-368

@avantgardnerio
Copy link
Contributor Author

Sorry this stalled out. I was really close on the subqueries, and I think I have them all working now, so I'll jump back on this.

@avantgardnerio
Copy link
Contributor Author

Added unit tests. Will file PRs to respective dependency repos as well.

  • chrono - I can figure out where to send the PR
  • arrow - any idea what file / impl the functions listed as TODO should go into?

@avantgardnerio
Copy link
Contributor Author

I PRed the chrono repo.

@avantgardnerio
Copy link
Contributor Author

avantgardnerio commented Jul 7, 2022

Once that is done I can write up a "move this to arrow" type ticket

@alamb , do you know where (in arrow-rs), I can PR these functions to:

  • fn create_day_time(days: i32, millis: i32) -> i64
  • fn create_month_day_nano(months: i32, days: i32, nanos: i64) -> i128

?

@avantgardnerio
Copy link
Contributor Author

avantgardnerio commented Jul 7, 2022

or perhaps https://docs.rs/arrow/16.0.0/arrow/compute/kernels/arithmetic/fn.add.html

Okay, I think I understand this. Ideally math_op would look something like this:

pub fn math_op<LT, RT, F>(
    left: &PrimitiveArray<LT>,
    right: &PrimitiveArray<RT>,
    op: F,
) -> Result<PrimitiveArray<LT>>
where
    LT: ArrowNumericType,
    RT: ArrowNumericType,
    F: Fn(LT::Native, RT::Native) -> LT::Native,
{

So that we could add an array of IntervalXXX to and array of DateXXX. Unfortunately, that results in combinatorial hell ATM:

error[E0277]: cannot add `i16` to `i8`
   --> arrow/src/compute/kernels/arithmetic.rs:741:51
    |
741 |         _ => typed_math_op!(left, right, |a, b| a + b),
    |                                                   ^ no implementation for `i8 + i16`
    |
    = help: the trait `Add<i16>` is not implemented for `i8`
    = help: the following other types implement trait `Add<Rhs>`:
              <&'a f32 as Add<Complex<f32>>>
              <&'a f32 as Add<f32>>
              <&'a f64 as Add<Complex<f64>>>
              <&'a f64 as Add<f64>>
              <&'a i128 as Add<BigInt>>
              <&'a i128 as Add<Complex<i128>>>
              <&'a i128 as Add<i128>>
              <&'a i16 as Add<BigInt>>
            and 176 others

🤔

@avantgardnerio
Copy link
Contributor Author

I think I see how to PR this into the arrow kernels. I'll try that shortly.

@avantgardnerio
Copy link
Contributor Author

What this is looking like when done inside arrow-rs: apache/arrow-rs@master...spaceandtimelabs:arrow-rs:bg_date_math

@alamb
Copy link
Contributor

alamb commented Jul 8, 2022

@alamb , do you know where (in arrow-rs), I can PR these functions to:

I recommend https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/temporal.rs

@alamb
Copy link
Contributor

alamb commented Jul 8, 2022

For what it is worth apache/arrow-rs@master...spaceandtimelabs:arrow-rs:bg_date_math looks pretty good to me 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking quite good -- thank you @avantgardnerio . I like the idea of getting the test coverage and then expanding support by adding kernels to arrow.

@@ -44,6 +44,7 @@ arrow = { version = "17.0.0", features = ["prettyprint"] }
blake2 = { version = "^0.10.2", optional = true }
blake3 = { version = "1.0", optional = true }
chrono = { version = "0.4", default-features = false }
chronoutil = "0.2.3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat concerned about adding dependencies (and this particular crate seems to be maintained by a single person) -- however, since it is so small (and MIT licensced), we could also just inline the functions we care about.

Any thoughts @andygrove or @thinkharderdev ?

https://github.com/olliemath/chronoutil/blob/main/src/delta.rs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the PR so am not sure how much of that create is used but I agree if it is a small amount of code it is better to just copy the code rather than add another dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inlined it in my latest push. If we want to go back to including it in cargo.toml, it's an easy revert.

My $0.02 would be to leave it in cargo.toml, so that we can get updates if the author fixes something, and so that RAT checkers can see declaratively that we depend on that code, rather than trying to use heuristics of sematic code or binary diff.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go from my perspective. Does anyone else care to review?

cc @ovr this may be interesting / relevant to your project
cc @paddyhoran

};

// Add interval
let posterior = match scalar {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/physical-expr/src/expressions/datetime.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/expressions/datetime.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/expressions/datetime.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/expressions/datetime.rs Outdated Show resolved Hide resolved
@andygrove andygrove merged commit 5a63c87 into apache:master Jul 12, 2022
@avantgardnerio avantgardnerio deleted the bg_month_interval branch July 12, 2022 17:22
@alamb
Copy link
Contributor

alamb commented Jul 12, 2022

🐱 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to work with month intervals
5 participants