Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom window frame #3475

Closed
wants to merge 5 commits into from
Closed

Custom window frame #3475

wants to merge 5 commits into from

Conversation

mustafasrepo
Copy link
Contributor

Datafusion First PR

export POSTGRES_DB=test
export POSTGRES_USER=postgres
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432

Which issue does this PR close?

We Implement Window Function Calls of Postgres and improve the situation on #360.

We started to contribute to Datafusion just now. Since we are creating a PR for the project for the first time, we would like to get your ideas to close this issue completely with a partial implementation for now. You can see which cases we cover in integration tests.

Rationale for this change

Datafusion did not support window call functions, it is on the roadmap.

What changes are included in this PR?

For now, we implemented ROWS and RANGE modes supporting PRECEDING and FOLLOWING.

As a draft, we currently do not support

  • GROUPS mode
  • Timestamp ranges i.e RANGE BETWEEN '1 day' PRECEDING AND '10 days' FOLLOWING
  • Frame exclusion, i.e EXCLUDE CURRENT ROW

Next steps

  • GROUPS mode implementation by extending calculate_current_window method.
  • Frame exclusion, by logic planner extension and adapting calculate_current_window method.

Observations

  • Since some aggregation functions only use f64, there are numerical problems with statistical aggregation functions like CORR(x,y), and they can be enhanced to support other DataTypes similar to SUM(x) aggregation.

    Also, evaluation() of the CovarianceAccumulator should be

    @ -374,12 +374,6 @@ impl Accumulator for CovarianceAccumulator {
      };
    
      if count <= 1 {
    -      return Err(DataFusionError::Internal(
    -          "At least two values are needed to calculate covariance".to_string(),
    -      ));
    -  }
    -
    -  if self.count == 0 {
          Ok(ScalarValue::Float64(None))
      } else {
          Ok(ScalarValue::Float64(Some(self.algo_const / count as f64)))
    

    /

    to become the same as PostgreSQL. However, these issues are separate from this PR. We did not use CORR(x,y) because of these problems.

  • Since the sorting is unstable, some queries output different results than PostgreSQL. We use only unique columns for ORDER BY clauses while testing ROWS mode.

An example query:

SELECT c2, c3,
   SUM(c2) OVER(ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation2,
   SUM(c3) OVER(ORDER BY c2 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation3,
   SUM(c3) OVER(ORDER BY c1 ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) as summation4
FROM test
LIMIT 10;

Outputs in Datafusion as

+----+-----+------------+------------+------------+
| c2 | c3  | summation2 | summation3 | summation4 |
+----+-----+------------+------------+------------+
| 1  | 12  | 2          | 132        | -13        |
| 1  | 120 | 3          | 203        | 263        |
| 1  | 71  | 4          | 118        | 447        |
| 1  | -85 | 5          | 154        | 37         |
| 1  | 36  | 5          | 43         | 358        |
| 1  | -99 | 5          | 48         | -81        |
| 1  | 125 | 5          | -31        | 247        |
| 1  | -8  | 5          | 111        | 215        |
| 1  | 57  | 5          | 3          | 238        |
| 1  | -72 | 5          | 140        | 247        |
+----+-----+------------+------------+------------+

and in PostgreSQL as

| c2  | c3  | summation2 | summation3 | summation4 |
| --- | --- | ---------- | ---------- | ---------- |
| 1   | -85 | 2          | -49        | -251       |
| 1   | 36  | 3          | 71         | 330        |
| 1   | 120 | 4          | 46         | 284        |
| 1   | -25 | 5          | 149        | -184       |
| 1   | 103 | 5          | 305        | -15        |
| 1   | 71  | 5          | 323        | 251        |
| 1   | 54  | 5          | 286        | 48         |
| 1   | 83  | 5          | 255        | 166        |
| 1   | -56 | 5          | 222        | -79        |
| 1   | 70  | 5          | 180        | -233       |
+-----+-----+------------+------------+------------+
  • There is a minor problem in the logical planner, it should run
SELECT 
 SUM(c2) OVER(ORDER BY c5, c6 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum
FROM test;

without a problem, however, it produces

Error: Plan("With window frame of type RANGE, the order by expression must be of length 1, got 2")

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sql SQL Planner labels Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants