Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-u64 types for Window Bound #3916

Merged
merged 19 commits into from
Oct 26, 2022
Merged

Support for non-u64 types for Window Bound #3916

merged 19 commits into from
Oct 26, 2022

Conversation

mustafasrepo
Copy link
Contributor

@mustafasrepo mustafasrepo commented Oct 21, 2022

Support for non-u64 types for Window Bound

Which issue does this PR close?

Closes #3571 .

Rationale for this change

With this change we are adding support for queries like below.

SELECT 
	COUNT(*) OVER(ORDER BY ts RANGE BETWEEN '1 DAY' PRECEDING AND '1 DAY' FOLLOWING) as cnt 
FROM t

What changes are included in this PR?

SELECT
      COUNT(c1) OVER (ORDER BY c2 RANGE BETWEEN 1 PRECEDING AND 2 PRECEDING)
      FROM aggregate_test_100

during window frame creation. Since we are supporting more reach variants such as

SELECT 
	COUNT(*) OVER(ORDER BY ts RANGE BETWEEN '1 DAY' PRECEDING AND '1 MONTH' PRECEDING) as cnt 
FROM t

it is not easy to do rejection during creation. We now do calculations for whether window frame is valid after physical schema information is available. Where we can do validation calculations with appropriate types. This doesn’t change anything for end users.

  • sqlparser had also changes in Expr type. We have incorporated those changes. However, possibly they are written with some new capabilities in mind. We just did the bare minimum to support the current test suit.

Are there any user-facing changes?

N.A

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sql SQL Planner labels Oct 21, 2022
@mustafasrepo mustafasrepo changed the title Feature/window bound scalar Support for non-u64 types for Window Bound Oct 21, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @mustafasrepo -- these PRs are very impressive; I am very excited to see additional date/time functionality added to DataFusion and thus I try and prioritize reviewing such PRs.

One thing that would make it easier to review PRs such as this would be to break them up into multiple smaller PRs. The benefits of doing so are:

  1. decreases the contiguous focus time (hard to find some days) needed to review them
  2. increases the number of people who can review / merge them (we have many reviewers / committers, but only a few who feel comfortable reviewing a PR that touches so much of the codebase)

For example in this case multiple PRs could be:

  1. Update sqlparser-rs
  2. Update parser and logical planning (but error if anything other than a uint64 was passed)
  3. move the location of test utils / code to common
  4. physical planner / physical expr changes

The only question I have about this PR that I think needs to be addressed before merging is the type coercion (see comment inline)

cc @waitingkuo @avantgardnerio @ovr

datafusion/common/src/delta.rs Show resolved Hide resolved
impl_distinct_cases_op!($LHS, $RHS, $OPERATION)
// Binary operations on arguments with different types:
(ScalarValue::Date32(Some(days)), _) => {
let value = date32_add(*days, $RHS, get_sign!($OPERATION))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a generic macro I am surprised to see date32_add -- this would likely be surprising to someone if they tried to use this macro for mul or div I think

};
}

#[inline]
pub fn date32_add(days: i32, scalar: &ScalarValue, sign: i32) -> Result<i32> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I would like to consolidate date / time arithmetic into a single location, and ideally that location is arrow-rs.

Thus I think we should be using the date math functions in arrow-rs -- specifically https://docs.rs/arrow/25.0.0/arrow/datatypes/struct.Date32Type.html

I see that this code is simply moved in this PR, but in general what do you think?

datafusion/common/src/scalar.rs Outdated Show resolved Hide resolved
datafusion/common/src/test_util.rs Show resolved Hide resolved
datafusion/expr/src/window_frame.rs Show resolved Hide resolved
datafusion/expr/src/window_frame.rs Outdated Show resolved Hide resolved
datafusion/expr/src/window_frame.rs Outdated Show resolved Hide resolved
/// if `column_type` is Timestamp the result is casted to Interval type. The reason is that
/// Operation between Timestamps is not meaningful, However operation between Timestamp and
/// Interval is valid. For basic types `column_type` is indeed the resulting type.
fn convert_to_column_type(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation is typically called "coercion" and is handled in datafusion for other expression types here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/type_coercion.rs#L18-L32 (and submodule).

Did you consider doing this conversion as part of coercion?

One reason to do it as part of the normal coercion is that then the proper types will be present for operations such as constant folding / constant propagation. This might allow for expressions like

                COUNT(*) OVER (ORDER BY ts RANGE BETWEEN INTERVAL '1' DAY + INTERVAL '1' DAY PRECEDING AND INTERVAL '3 DAY' FOLLOWING),

Eventually

@ozankabak
Copy link
Contributor

Thank you Andrew for your careful review and comments (as always). We will go through them and update the PR soon.

@alamb
Copy link
Contributor

alamb commented Oct 24, 2022

What is your thinking regarding moving coercion of out physical planning and into the coercion pass? Is that something you plan to do?

@ozankabak
Copy link
Contributor

ozankabak commented Oct 25, 2022

@mustafasrepo is trying to solidify his understanding of your suggestion. He will probably circle back tomorrow to make sure he gets what you meant right -- and then follow through if you guys are on the same page. He may ask for some clarifying directions of you on how to do this.

@github-actions github-actions bot added the optimizer Optimizer rules label Oct 25, 2022
@mustafasrepo
Copy link
Contributor Author

mustafasrepo commented Oct 25, 2022

What is your thinking regarding moving coercion of out physical planning and into the coercion pass? Is that something you plan to do?

Hi @alamb, I sent a commit moving coercion from physical planning to the optimizer type_coercion. Is this what you had in mind or I misunderstood what you meant? I can change the implementation according to your suggestions. Thanks for your feedback.

@alamb
Copy link
Contributor

alamb commented Oct 25, 2022

Hi @alamb, I sent a commit moving coercion from physical planning to the optimizer type_coercion. Is this what you had in mind or I misunderstood what you meant? I can change the implementation according to your suggestions. Thanks for your feedback.

This is exactly right -- thanks

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is ready to go -- thank you @mustafasrepo .

I will plan to merge it tomorrow unless anyone else would like to offer suggestions or needs more time to review it

@alamb alamb added the api change Changes the API exposed to users of the crate label Oct 25, 2022
@alamb alamb merged commit 3940e36 into apache:master Oct 26, 2022
@alamb
Copy link
Contributor

alamb commented Oct 26, 2022

Thanks again @mustafasrepo and @ozankabak -- this is epic work

@ursabot
Copy link

ursabot commented Oct 26, 2022

Benchmark runs are scheduled for baseline = 4e29835 and contender = 3940e36. 3940e36 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Dandandan pushed a commit to yuuch/arrow-datafusion that referenced this pull request Nov 5, 2022
* Support for non u64 values inside window frames

* timestamp handling is added

* get_rank removed

* Tidy up datetime arithmetic, get rid of unwraps

* new test for validity, during window frame creation is added

* window_bound is changed to Expr (apache#6)

* give fixed commit hash as dependency

* remove locked flag

* Simplify some frame bound conversion functions

* Make things work the new sqlparser

* Upgrade sqlparser version to 0.26

* Minor changes

* type coercion for window frames moved to the optimizer.

Co-authored-by: Mehmet Ozan Kabak <[email protected]>
@mustafasrepo mustafasrepo deleted the feature/window_bound_scalar branch November 28, 2022 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hardcoded u64 for WindowFrameBound fields
5 participants