Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not ORDER BY an aliased group column #4854

Closed
alamb opened this issue Jan 8, 2023 · 12 comments · Fixed by #5067
Closed

Can not ORDER BY an aliased group column #4854

alamb opened this issue Jan 8, 2023 · 12 comments · Fixed by #5067
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@alamb
Copy link
Contributor

alamb commented Jan 8, 2023

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
using datafusion-cli load some data:

❯ create or replace table t as select column1 as value, column2 as time from (select * from (values
  (1, timestamp '2022-01-01 00:00:00'),
  (2, timestamp '2022-01-01 01:00:00'),
  (3, timestamp '2022-01-02 00:00:00')
) as sq) as sq
;

0 rows in set. Query took 0.001 seconds.
❯ select *  from t;
+-------+---------------------+
| value | time                |
+-------+---------------------+
| 1     | 2022-01-01T00:00:00 |
| 2     | 2022-01-01T01:00:00 |
| 3     | 2022-01-02T00:00:00 |
+-------+---------------------+
3 rows in set. Query took 0.001 seconds.
❯ ;

GROUP BY "time" works:

select
  sum(value) AS "value",
  date_trunc('hour',time) AS "time"
FROM t
GROUP BY time;
+-------+---------------------+
| value | time                |
+-------+---------------------+
| 3     | 2022-01-02T00:00:00 |
| 2     | 2022-01-01T01:00:00 |
| 1     | 2022-01-01T00:00:00 |
+-------+---------------------+
3 rows in set. Query took 0.002 seconds.

However, GROUP BY "time" ORDER BY "time" does not work:

select
  sum(value) AS "value",
  date_trunc('month',time) AS "time"
FROM t
GROUP BY time
ORDER BY time;

SchemaError(AmbiguousReference { qualifier: Some("t"), name: "time" })

Expected behavior
I expect the query to run and produce ordered results like postgres:

postgres=# select
  sum(value) AS "value",
  date_trunc('month',time) AS "time"
FROM t
GROUP BY time
ORDER BY time;
 value |        time         
-------+---------------------
     3 | 2022-01-01 00:00:00
     2 | 2022-01-01 00:00:00
     1 | 2022-01-01 00:00:00
(3 rows)

Additional context

Note I was confused about why this query wasn't actually aggregating on the truncated date (as in why are there 3 output rows rather than 1 output row). The reason is that the GROUP BY time is (correctly) evaluated before the date_trunc function evaluated in the select list.

To group by the truncated date, you need to group by date_trunc('month',time) AS "time" , which you can do using 2

select
  sum(value) AS "value",
  date_trunc('month',time) AS "time"
FROM t
GROUP BY 2
ORDER BY 2;
+-------+---------------------+
| value | time                |
+-------+---------------------+
| 6     | 2022-01-01T00:00:00 |
+-------+---------------------+
1 row in set. Query took 0.005 seconds.
@alamb alamb added bug Something isn't working help wanted Extra attention is needed labels Jan 8, 2023
@alamb alamb changed the title can not ORDER BY a an aliased group column Can not ORDER BY an aliased group column Jan 8, 2023
@crepererum
Copy link
Contributor

Note that ORDER BY uses the columns generated by GROUP BY, NOT the original columns. This can be checked w/ postgres:

postgres=# select
  -sum(value) AS "value",
  date_trunc('month',time) AS "time"
FROM t
GROUP BY time
ORDER BY value;
 value |        time         
-------+---------------------
    -3 | 2022-01-01 00:00:00
    -2 | 2022-01-01 00:00:00
    -1 | 2022-01-01 00:00:00
(3 rows)

Also note that columns that are NOT part of the GROUP BY output cannot be used in ORDER BY:

postgres=# select
  date_trunc('month',time) AS "time"
FROM t                              
GROUP BY time
ORDER BY value;
ERROR:  column "t.value" must appear in the GROUP BY clause or be used in an aggregate function
LINE 5: ORDER BY value;

Looking at the current LogicalPlanBuilder::sort code it seems that the "missing column"-handling or the rewrite_sort_cols_by_aggs bit in there are somewhat broken.

@alamb
Copy link
Contributor Author

alamb commented Jan 10, 2023

cc @jackwener who may have some ideas

FWIW the logical order of operations is like this (from https://learnsql.com/blog/sql-order-of-operation)

image

So as you are describing yes the columns which are available to ORDER are the output of the GROUP BY -- so that means either GROUP BY columns or aggregates (or some expression of them)

For example, this is valid:

SELECT a,b, sum(c)
FROM foo
GROUP BY a, b
ORDER BY a + b; -- note it is order by a+b (an expression of the group columns)
-- also note the ORDER BY can not contain `c`

@jackwener
Copy link
Member

jackwener commented Jan 11, 2023

I debug this bug.

Before apply Sort, the plan is:

Projection: SUM(t.value) AS value, datetrunc(Utf8("month"), t.time) AS time
  Aggregate: groupBy=[[t.time]], aggr=[[SUM(t.value)]]
    TableScan: t

Error in rewrite_sort_cols_by_aggs

@jackwener
Copy link
Member

jackwener commented Jan 11, 2023

I have found this BUG.

It's because rewrite_sort_col_by_aggs. When meet LogicalProject(), it will go through it, and continue find Aggregate.
So, it will bind sort time -> sort t.time. When meet sum(value) AS "value" in projection, it cause AmbiguousReference.

@crepererum
Copy link
Contributor

I came up with a similar conclusion, but honestly wasn't able to puzzle the pieces together and figure out which part exactly should be changed. I leave the fix to @jackwener then 😊

@jackwener
Copy link
Member

I will try to fix it.🚀

@jackwener
Copy link
Member

So sorry for late do it. Recent I am busy working🥲.
I've been investigating this for a while. But I find It isn't easy thing.

image

In standard sql, order by is after project. BUT, in fact, reality is not so.
order by is from project.

but sometime order by is from child of project. For example:

spark-sql> explain extended select uuid from hudi_test order by price+1;
== Parsed Logical Plan ==
'Sort [('price + 1) ASC NULLS FIRST], true
+- 'Project ['uuid]
   +- 'UnresolvedRelation [hudi_test], [], false

== Analyzed Logical Plan ==
uuid: int
Project [uuid#46]
+- Sort [(price#48 + cast(1 as double)) ASC NULLS FIRST], true
   +- Project [uuid#46, price#48]
      +- SubqueryAlias spark_catalog.default.hudi_test
         +- Relation default.hudi_test[_hoodie_commit_time#41,_hoodie_commit_seqno#42,_hoodie_record_key#43,_hoodie_partition_path#44,_hoodie_file_name#45,uuid#46,name#47,price#48] parquet

we can find order by is before project.

So, when exist aggregation. like above

select
  sum(value) AS "value",
  date_trunc('month',time) AS "time"
FROM t
GROUP BY time
ORDER BY time;

two condition: orderby-project-agg / project-orderby-agg.

In this sql, order-by: time, time exist in agg, so datafusion use project-orderby-agg, but in fact, time is from project.

We should prefer to use orderby-project-agg and if fail then use project-orderby-agg

@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2023

Hi @jackwener -- you are right this is tricky. I think the correct semantics are to resolve the ORDER BY in terms of the output schema of the stage (not the output of the select list)

spark-sql> explain extended select uuid from hudi_test order by price+1;

As I recall the way postgres handled this case was to add price+1 to the select list but mark it "hidden" so it was removed from the final select list

Maybe we can also special case using the postgres model of "resolve using select list" and if that is not possible, try and "pull" up relevant columns through the Projection. For example, if the input Plan to order_by https://github.com/apache/arrow-datafusion/blob/master/datafusion/sql/src/query.rs#L144 was

Projection(column1)
  TableScan(column1, column2)

And the sort was by column2:

Today the code will try to put the sort above the projection:

Sort(column2) <-- will error / resolve to the incorrect alias
  Projection(column1)
    TableScan(column1, column2)

Perhaps we could put the Sort below the Projection like

Projection(column1)
  Sort(coumn2)
    TableScan(column1, column2)

🤔

I can take a shot at doing this if you wanted. A few of our users have hit this so I am incentivized to try and help this

Here are some other examples from postgres

postgres=# create table foo as values (1, 2), (3, 4), (5, 6);
SELECT 3
postgres=# select * from foo;
 column1 | column2 
---------+---------
       1 |       2
       3 |       4
       5 |       6
(3 rows)


postgres=# select column1 from foo order by column2;
 column1 
---------
       1
       3
       5
(3 rows)

postgres=# select column1 from foo order by column1 + column2;
 column1 
---------
       1
       3
       5
(3 rows)

But for trickier stuff postgres requires the expressions to appear directly in the select list

postgres=# select distinct column1 from foo order by column1 + column2;
ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 1: select distinct column1 from foo order by column1 + column2;

@jackwener
Copy link
Member

I can take a shot at doing this if you wanted.

Thanks you❤️

@alamb
Copy link
Contributor Author

alamb commented Jan 25, 2023

👍 @jackwener I plan on working on this later today

@alamb
Copy link
Contributor Author

alamb commented Jan 26, 2023

I am making progress on this ticket -- I will provide a better update tomorrow.

@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2023

I believe I have a solution here #5067 (but I broke out some mechanical changes in #5092 and #5088 which need to be reviewed first)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants