-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL: Allow sorting of groups by aggregates #38042
Conversation
Introduce client-side, custom sorting of groups based on aggregate functions. To allow this, the Analyzer has been extended to push down to underlying Aggregate, aggregate function and the Querier has been extended to identify the case and consume the results in order and sort them based on the given columns. The underlying QueryContainer has been slightly modified to allow a view of the underlying values being extracted as the columns used for sorting might not be requested by the user. The PR also adds minor tweaks, mainly related to tree output. Close elastic#35118
Pinging @elastic/es-search |
There are still some things that need to be added in a future PR, mainly:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, Really important feature!
Left some comments and additionally:
- Please add tests with ordering on multiple columns (tie breaker)
- Please add tests with ordering using number indexes (e.g.: ORDER BY 1, 2)
- Please add tests with descending ordering
- Please remove the paragraph from the limitations doc page.
aggWithAlias | ||
SELECT MAX(salary) AS m FROM test_emp GROUP BY gender ORDER BY m; | ||
|
||
multipleAggsThatGetRewrittenWithoutAlias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean Rewritten
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max and avg and count get rewritten/replaced with a Stats agg.
} | ||
|
||
// 2. find first Aggregate child and update it | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove empty line
@@ -310,11 +310,31 @@ private static boolean checkGroupByOrder(LogicalPlan p, Set<Failure> localFailur | |||
Aggregate a = (Aggregate) child; | |||
|
|||
Map<Expression, Node<?>> missing = new LinkedHashMap<>(); | |||
|
|||
// track aggs and non-aggs - to keep the final modifier, use an array | |||
final Expression[] aggAndNonAgg = new Expression[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a boolean[]
instead here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because if found, the expression is used in the validation message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, missed that part
|
||
// 2. find first Aggregate child and update it | ||
|
||
final AtomicBoolean found = new AtomicBoolean(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is Atomic needed here? is there some kind of concurrency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not it's an utility class for storing state outside a lamba expression. Since the class is already part of the JDK (so it's likely to be inlined) it's quite convenient. The alternative is to create our own Holder class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would opt for a boolean[] found = new boolean[] {false}
instead, to avoid confusion and because AtomicBoolean has special semantics for the compiler to achieve the atomic operations in CPU level which is a bit too much, imho, just for convenience.
/** | ||
* Listener used for local sorting (typically due to aggregations used inside `ORDER BY`). | ||
* | ||
* This listener consumes the whole result set, sorts it in memory then send the paginates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...then sends the paginated results back to the client
// if the sort points to an agg, consider it only if there's grouping | ||
if (ob.child() instanceof Aggregate) { | ||
Aggregate a = (Aggregate) ob.child(); | ||
AtomicBoolean foundAggregate = new AtomicBoolean(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too? why AtomicBoolean?
Querier scroller = new Querier(session.client(), session.configuration()); | ||
scroller.query(Rows.schema(output), queryContainer, index, listener); | ||
Querier scroller = new Querier(session); | ||
// List<ExpressionId> ids = queryContainer.columns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover?
public void testGroupByOrderByNonKey() { | ||
assertEquals("1:52: Cannot order by non-grouped column [a], expected [bool]", | ||
error("SELECT AVG(int) a FROM test GROUP BY bool ORDER BY a")); | ||
// public void testGroupByOrderByNonKey() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover, please remove.
@@ -64,22 +64,22 @@ public void testExplainWithWhere() throws IOException { | |||
assertThat(readLine(), startsWith("----------")); | |||
assertThat(readLine(), startsWith("With[{}]")); | |||
assertThat(readLine(), startsWith("\\_Project[[?*]]")); | |||
assertThat(readLine(), startsWith(" \\_Filter[i = 2#")); | |||
assertThat(readLine(), startsWith(" \\_UnresolvedRelation[[][index=test],null,Unknown index [test]]")); | |||
assertThat(readLine(), startsWith(" \\_Filter[Equals[?i")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in the other places: Can't we also still check the 2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can for unresolved items but for resolved ones not really since the tests are ran in random order and the ids change between runs. Fwiw, the final generated query is tested and the value checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important at all but can't we use endsWith
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because 2 is a literal with its own id.
// Custom sorting/ordering on aggregates | ||
// | ||
|
||
countWithImplicitGroupBy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find the value of having all these different cases without group by, as only 1 value is returned and the ORDER BY doesn't really apply, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed but there's a dedicated rule in the optimizer for this case that excludes the order (as there's no physical group for it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff. Left few minor comments
@@ -310,11 +310,31 @@ private static boolean checkGroupByOrder(LogicalPlan p, Set<Failure> localFailur | |||
Aggregate a = (Aggregate) child; | |||
|
|||
Map<Expression, Node<?>> missing = new LinkedHashMap<>(); | |||
|
|||
// track aggs and non-aggs - to keep the final modifier, use an array | |||
final Expression[] aggAndNonAgg = new Expression[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't have been more clear/easier to manage with two separate Expression
s named accordingly? (ie agg/nonAgg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so because one would have two arrays - this is really a hack for the setting the state outside a lamba expression. I'll probably go for a Holder
class which be more meaningful in intent.
/** | ||
* Listener used for local sorting (typically due to aggregations used inside `ORDER BY`). | ||
* | ||
* This listener consumes the whole result set, sorts it in memory then send the paginates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"send AND paginates..." you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"sends the paginated" response. Fixed.
LocalAggregationSorterListener(ActionListener<SchemaRowSet> listener, List<Tuple<Integer, Comparator>> sortingColumns, int limit) { | ||
this.listener = listener; | ||
|
||
this.data = new PriorityQueue<Tuple<List<?>, Integer>>(Math.max(limit, 100)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extract the 100
value as a constant to be easier to change in the future? (very likely we will change it or allow the user to configure a default value)
Querier scroller = new Querier(session.client(), session.configuration()); | ||
scroller.query(Rows.schema(output), queryContainer, index, listener); | ||
Querier scroller = new Querier(session); | ||
// List<ExpressionId> ids = queryContainer.columns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I've removed them
} | ||
|
||
public void testGroupByOrderByFunctionOverKey() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing this method? Why not only adding the new one?
@@ -312,14 +312,19 @@ public void testUnsupportedTypeInOrder() { | |||
error("SELECT * FROM test ORDER BY unsupported")); | |||
} | |||
|
|||
public void testGroupByOrderByNonKey() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a bit of reworking.
One idea for a test (from the odbc tests): |
|
assertEquals("1:44: Cannot order by non-grouped column [MAX(int)], expected [int]", | ||
error("SELECT int FROM test GROUP BY int ORDER BY MAX(int)")); | ||
public void testGroupByOrderByAggAndGroupedColumn() { | ||
assertEquals("1:49: Cannot order by aggregated [MAX(int)] and non-aggregated [int] columns" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test where ORDER BY
has more than two elements?
I would also add the |
@matriv see number 2) in my initial comment. |
@costin yep read the comment, I was just thinking to have like an autonomous complete PR. But I'm totally fine to add the proper docs once this bit is figured out and implemented correctly. Personally, if no LIMIT is provided by the user, I'd favour to have a documented implicit limit rather than exhausting available memory and throw an Exception. On the other hand if the latter can implemented easily with the use of a CircuitBreaker maybe it's a nice option too. |
I've created a separate issue for this since it needs backporting to 6.x. It's not just the message but also the fact that the Analyzer rule throws an exception (with an incorrect message). |
In case the LIMIT is not specified, the hard limit (256) is enforced however the moment an entry is dropped, an exception is thrown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great feature.
@elasticmachine run elasticsearch-ci/packaging-sample |
Introduce client-side sorting of groups based on aggregate functions. To allow this, the Analyzer has been extended to push down to underlying Aggregate, aggregate function and the Querier has been extended to identify the case and consume the results in order and sort them based on the given columns. The underlying QueryContainer has been slightly modified to allow a view of the underlying values being extracted as the columns used for sorting might not be requested by the user. The PR also adds minor tweaks, mainly related to tree output. Close elastic#35118 (cherry picked from commit 783c9ed)
Introduce client-side sorting of groups based on aggregate functions. To allow this, the Analyzer has been extended to push down to underlying Aggregate, aggregate function and the Querier has been extended to identify the case and consume the results in order and sort them based on the given columns. The underlying QueryContainer has been slightly modified to allow a view of the underlying values being extracted as the columns used for sorting might not be requested by the user. The PR also adds minor tweaks, mainly related to tree output. Close #35118 (cherry picked from commit 783c9ed)
Introduce client-side, custom sorting of groups based on aggregate
functions. To allow this, the Analyzer has been extended to push down
to underlying Aggregate, aggregate function and the Querier has been
extended to identify the case and consume the results in order and sort
them based on the given columns.
The underlying QueryContainer has been slightly modified to allow a view
of the underlying values being extracted as the columns used for sorting
might not be requested by the user.
The PR also adds minor tweaks, mainly related to tree output.
Close #35118