-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not ORDER BY an aliased group column #4854
Comments
Note that postgres=# select
-sum(value) AS "value",
date_trunc('month',time) AS "time"
FROM t
GROUP BY time
ORDER BY value;
value | time
-------+---------------------
-3 | 2022-01-01 00:00:00
-2 | 2022-01-01 00:00:00
-1 | 2022-01-01 00:00:00
(3 rows) Also note that columns that are NOT part of the postgres=# select
date_trunc('month',time) AS "time"
FROM t
GROUP BY time
ORDER BY value;
ERROR: column "t.value" must appear in the GROUP BY clause or be used in an aggregate function
LINE 5: ORDER BY value; Looking at the current |
cc @jackwener who may have some ideas FWIW the logical order of operations is like this (from https://learnsql.com/blog/sql-order-of-operation) So as you are describing yes the columns which are available to ORDER are the output of the GROUP BY -- so that means either For example, this is valid: SELECT a,b, sum(c)
FROM foo
GROUP BY a, b
ORDER BY a + b; -- note it is order by a+b (an expression of the group columns)
-- also note the ORDER BY can not contain `c` |
I debug this bug. Before apply Projection: SUM(t.value) AS value, datetrunc(Utf8("month"), t.time) AS time
Aggregate: groupBy=[[t.time]], aggr=[[SUM(t.value)]]
TableScan: t Error in |
I have found this BUG. It's because |
I came up with a similar conclusion, but honestly wasn't able to puzzle the pieces together and figure out which part exactly should be changed. I leave the fix to @jackwener then 😊 |
I will try to fix it.🚀 |
So sorry for late do it. Recent I am busy working🥲. In standard sql, but sometime spark-sql> explain extended select uuid from hudi_test order by price+1;
== Parsed Logical Plan ==
'Sort [('price + 1) ASC NULLS FIRST], true
+- 'Project ['uuid]
+- 'UnresolvedRelation [hudi_test], [], false
== Analyzed Logical Plan ==
uuid: int
Project [uuid#46]
+- Sort [(price#48 + cast(1 as double)) ASC NULLS FIRST], true
+- Project [uuid#46, price#48]
+- SubqueryAlias spark_catalog.default.hudi_test
+- Relation default.hudi_test[_hoodie_commit_time#41,_hoodie_commit_seqno#42,_hoodie_record_key#43,_hoodie_partition_path#44,_hoodie_file_name#45,uuid#46,name#47,price#48] parquet we can find So, when exist select
sum(value) AS "value",
date_trunc('month',time) AS "time"
FROM t
GROUP BY time
ORDER BY time; two condition: orderby-project-agg / project-orderby-agg. In this sql, We should prefer to use |
Hi @jackwener -- you are right this is tricky. I think the correct semantics are to resolve the ORDER BY in terms of the output schema of the stage (not the output of the select list)
As I recall the way postgres handled this case was to add Maybe we can also special case using the postgres model of "resolve using select list" and if that is not possible, try and "pull" up relevant columns through the Projection. For example, if the input Plan to
And the sort was by Today the code will try to put the sort above the projection:
Perhaps we could put the Sort below the Projection like
🤔 I can take a shot at doing this if you wanted. A few of our users have hit this so I am incentivized to try and help this Here are some other examples from postgres postgres=# create table foo as values (1, 2), (3, 4), (5, 6);
SELECT 3
postgres=# select * from foo;
column1 | column2
---------+---------
1 | 2
3 | 4
5 | 6
(3 rows)
postgres=# select column1 from foo order by column2;
column1
---------
1
3
5
(3 rows)
postgres=# select column1 from foo order by column1 + column2;
column1
---------
1
3
5
(3 rows) But for trickier stuff postgres requires the expressions to appear directly in the select list postgres=# select distinct column1 from foo order by column1 + column2;
ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 1: select distinct column1 from foo order by column1 + column2; |
Thanks you❤️ |
👍 @jackwener I plan on working on this later today |
I am making progress on this ticket -- I will provide a better update tomorrow. |
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
using datafusion-cli load some data:
GROUP BY "time"
works:However,
GROUP BY "time" ORDER BY "time"
does not work:Expected behavior
I expect the query to run and produce ordered results like postgres:
Additional context
Note I was confused about why this query wasn't actually aggregating on the truncated date (as in why are there 3 output rows rather than 1 output row). The reason is that the
GROUP BY time
is (correctly) evaluated before thedate_trunc
function evaluated in the select list.To group by the truncated date, you need to group by
date_trunc('month',time) AS "time"
, which you can do using2
The text was updated successfully, but these errors were encountered: