Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ORDER BY an aliased column #5067

Merged
merged 5 commits into from
Feb 6, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 25, 2023

Draft:

Which issue does this PR close?

Closes #4854

Rationale for this change

Column resolution for ORDER BY expressions are incorrect

What changes are included in this PR?

  1. Make the output rewrite general (they were incorrect for projections above aggregates)
  2. Add comments
  3. Add tests

Are these changes tested?

Yes,

Are there any user-facing changes?

bug fix

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jan 25, 2023
@alamb alamb force-pushed the alamb/support_groupby_orderby branch 3 times, most recently from 2cd0a0c to dbbbc25 Compare January 27, 2023 19:08
@@ -36,6 +36,12 @@ use datafusion_sql::planner::{ContextProvider, ParserOptions, SqlToRel};

use rstest::rstest;

#[cfg(test)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to enable debug logging in the sql tests

/// Remember that:
/// 1. given a projection with exprs: [a, b + c]
/// 2. t produces an output schema with two columns "a", "b + c"
fn rewrite_in_terms_of_projection(expr: Expr, projection: &Projection) -> Result<Expr> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the fix -- it has slightly different semantics than the raw aggregate case.

/// Note The SQL planner always puts a `Projection` at the output of
/// an aggregate, the other paths such as LogicalPlanBuilder can
/// create a Sort directly above an Aggregate
fn rewrite_in_terms_of_aggregate(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same logic, just extracted into a function and using rewrite_expr to make it easier to read

@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2023

Ok -- I think this PR is ready to review once the other PRs it depends on are ready

@alamb alamb force-pushed the alamb/support_groupby_orderby branch from 907964b to 6d85c1f Compare January 30, 2023 21:33
aggr_expr,
distinct_group_exprs: &distinct_group_exprs,
})
/// Rewrites a sort expression in terms of the output of the previous [`LogicalPlan`]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new more general algorithm that works for LogicalPlan::Projection and LogicalPlan::Aggregation

// if that doesn't work, try to match the expression as an
// output column -- however first it must be "normalized"
// (e.g. "c1" --> "t.c1") because that normalization is done
// at the input of the aggregate.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why this is needed, but several tests fail if this logic is removed and the previous logic did it as well

Copy link
Member

@jackwener jackwener Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe(Just to offer a possibility) it's related with qualifier.

In the comment:

/// 1. given a projection with exprs: [a, b + c]
/// 2. t produces an output schema with two columns "a", "b + c"

But in fact, in the plan, [a, b + c] can be [table1.a, "table2.b+c"].
output schema also may be with qualifier like above.
Due to qualifier, some equations are not true, like 'a' == 't1.a' will be false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in fact, in the plan, [a, b + c] can be [table1.a, "table2.b+c"].

Yes I think you are correct. I think what I found was confusing is that the qualifier is added "on demand" in several places and there isn't a clear cut line between "as written" and "qualified"

@@ -136,24 +162,24 @@ mod test {

let cases = vec![
TestCase {
desc: "c1 --> t.c1",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why direct aggregate creation doesn't need this rewrite anymore. All the other tests in DataFusion CI pass and I wrote the unit tests last week to document the existing behavior in #5088

If reviewers prefer to keep the old behavior here, that is easy as the first commit in this PR, 0c16ef1 actually keeps all these tests passing with the existing "aggregate" code.

// expect that this is not an ambiguous reference
let expected =
"Sort: birth_date ASC NULLS LAST\
\n Projection: AVG(person.age) AS value, datetrunc(Utf8(\"month\"), person.birth_date) AS birth_date\
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shows the plan is using the expression datetrunc rather than the original birth_date grouping column

@alamb alamb marked this pull request as ready for review January 31, 2023 10:50
@alamb
Copy link
Contributor Author

alamb commented Jan 31, 2023

This PR is now ready for review.

I don't know if @houqp or @andygrove have some time to review it too as I think they were involved with the order by / group by rewrites

Comment on lines 2318 to 2322
fn select_groupby_orderby() {
let sql = r#"SELECT
avg(age) AS "value",
date_trunc('month', birth_date) AS "birth_date"
FROM person GROUP BY birth_date ORDER BY birth_date;
Copy link
Member

@jackwener jackwener Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest add more test about qualified column, I'm worried that it may cause some problem.
Because I also try to fix this problem, but meet some barrier about qualified column.

such as use date_trunc('month', person.birth_date) AS "birth_date"
instead of
date_trunc('month', birth_date) AS "birth_date"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good idea -- Added in ba563ad

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review @jackwener

Comment on lines 2318 to 2322
fn select_groupby_orderby() {
let sql = r#"SELECT
avg(age) AS "value",
date_trunc('month', birth_date) AS "birth_date"
FROM person GROUP BY birth_date ORDER BY birth_date;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good idea -- Added in ba563ad

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very nice job🚀 I reviewed this PR intermittently for a few days.
Look great to me.
cc @mingmwang @liukun4515

@alamb
Copy link
Contributor Author

alamb commented Feb 5, 2023

Thank you for the review @jackwener

I will plan to merge this PR tomorrow (in about 24 hours) unless anyone else would like time to review

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🥳

@Dandandan Dandandan merged commit 6eb0e36 into apache:master Feb 6, 2023
@ursabot
Copy link

ursabot commented Feb 6, 2023

Benchmark runs are scheduled for baseline = b09c1ee and contender = 6eb0e36. 6eb0e36 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb alamb deleted the alamb/support_groupby_orderby branch February 6, 2023 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can not ORDER BY an aliased group column
4 participants