-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for non-u64 types for Window Bound #3916
Conversation
…ada-ai/arrow-datafusion into feature/window_bound_scalar # Conflicts: # datafusion/common/src/lib.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @mustafasrepo -- these PRs are very impressive; I am very excited to see additional date/time functionality added to DataFusion and thus I try and prioritize reviewing such PRs.
One thing that would make it easier to review PRs such as this would be to break them up into multiple smaller PRs. The benefits of doing so are:
- decreases the contiguous focus time (hard to find some days) needed to review them
- increases the number of people who can review / merge them (we have many reviewers / committers, but only a few who feel comfortable reviewing a PR that touches so much of the codebase)
For example in this case multiple PRs could be:
- Update sqlparser-rs
- Update parser and logical planning (but error if anything other than a uint64 was passed)
- move the location of test utils / code to common
- physical planner / physical expr changes
The only question I have about this PR that I think needs to be addressed before merging is the type coercion (see comment inline)
impl_distinct_cases_op!($LHS, $RHS, $OPERATION) | ||
// Binary operations on arguments with different types: | ||
(ScalarValue::Date32(Some(days)), _) => { | ||
let value = date32_add(*days, $RHS, get_sign!($OPERATION))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is a generic macro I am surprised to see date32_add
-- this would likely be surprising to someone if they tried to use this macro for mul
or div
I think
}; | ||
} | ||
|
||
#[inline] | ||
pub fn date32_add(days: i32, scalar: &ScalarValue, sign: i32) -> Result<i32> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I would like to consolidate date / time arithmetic into a single location, and ideally that location is arrow-rs.
Thus I think we should be using the date math functions in arrow-rs -- specifically https://docs.rs/arrow/25.0.0/arrow/datatypes/struct.Date32Type.html
I see that this code is simply moved in this PR, but in general what do you think?
/// if `column_type` is Timestamp the result is casted to Interval type. The reason is that | ||
/// Operation between Timestamps is not meaningful, However operation between Timestamp and | ||
/// Interval is valid. For basic types `column_type` is indeed the resulting type. | ||
fn convert_to_column_type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation is typically called "coercion" and is handled in datafusion for other expression types here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/type_coercion.rs#L18-L32 (and submodule).
Did you consider doing this conversion as part of coercion?
One reason to do it as part of the normal coercion is that then the proper types will be present for operations such as constant folding / constant propagation. This might allow for expressions like
COUNT(*) OVER (ORDER BY ts RANGE BETWEEN INTERVAL '1' DAY + INTERVAL '1' DAY PRECEDING AND INTERVAL '3 DAY' FOLLOWING),
Eventually
Thank you Andrew for your careful review and comments (as always). We will go through them and update the PR soon. |
What is your thinking regarding moving coercion of out physical planning and into the coercion pass? Is that something you plan to do? |
@mustafasrepo is trying to solidify his understanding of your suggestion. He will probably circle back tomorrow to make sure he gets what you meant right -- and then follow through if you guys are on the same page. He may ask for some clarifying directions of you on how to do this. |
Hi @alamb, I sent a commit moving coercion from physical planning to the optimizer type_coercion. Is this what you had in mind or I misunderstood what you meant? I can change the implementation according to your suggestions. Thanks for your feedback. |
This is exactly right -- thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is ready to go -- thank you @mustafasrepo .
I will plan to merge it tomorrow unless anyone else would like to offer suggestions or needs more time to review it
Thanks again @mustafasrepo and @ozankabak -- this is epic work |
Benchmark runs are scheduled for baseline = 4e29835 and contender = 3940e36. 3940e36 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
* Support for non u64 values inside window frames * timestamp handling is added * get_rank removed * Tidy up datetime arithmetic, get rid of unwraps * new test for validity, during window frame creation is added * window_bound is changed to Expr (apache#6) * give fixed commit hash as dependency * remove locked flag * Simplify some frame bound conversion functions * Make things work the new sqlparser * Upgrade sqlparser version to 0.26 * Minor changes * type coercion for window frames moved to the optimizer. Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Support for non-u64 types for Window Bound
Which issue does this PR close?
Closes #3571 .
Rationale for this change
With this change we are adding support for queries like below.
What changes are included in this PR?
assert_contains
andassert_not_contains
macros are used in tests at different places. These macros were duplicated where they were used, we have moved them under https://github.com/synnada-ai/arrow-datafusion/blob/feature/window_bound_scalar/datafusion/common/src/test_util.rs to prevent code duplication..add
and.sub
api ofScalarValue
enum.ScalarValue
types. Arithmetic operations are done only between same types (fordatetime
types arithmetic operations are done only withINTERVAL
types.)during window frame creation. Since we are supporting more reach variants such as
it is not easy to do rejection during creation. We now do calculations for whether window frame is valid after physical schema information is available. Where we can do validation calculations with appropriate types. This doesn’t change anything for end users.
Expr
type. We have incorporated those changes. However, possibly they are written with some new capabilities in mind. We just did the bare minimum to support the current test suit.Are there any user-facing changes?
N.A