-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL: Fix ORDER BY on aggregates and GROUPed BY fields #51894
Conversation
Previously, in the in-memory sorting module `LocalAggregationSorterListener` only the aggregate functions where used (grabbed by the `sortingColumns`). As a consequence, if the ORDER BY was also using columns of the GROUP BY clause, (especially in the case of higher priority - before the aggregate functions) wrong results were produced. E.g.: ``` SELECT gender, MAX(salary) AS max FROM test_emp GROUP BY gender ORDER BY gender, max ``` Add all columns of the ORDER BY to the `sortingColumns` so that the `LocalAggregationSorterListener` can use the correct comparators in the underlying PriorityQueue used to implement the in-memory sorting. Fixes: elastic#50355
Pinging @elastic/es-search (:Search/SQL) |
I'm thinking that the only optimisation we can do is maybe skip the columns after the last aggregate functions in the ORDER BY, For example in
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
comp = s.missing() == Sort.Missing.FIRST ? Comparator.nullsFirst(comp) : Comparator.nullsLast(comp); | ||
|
||
tuple = new Tuple<>(Integer.valueOf(atIndex), comp); | ||
customSort = Boolean.TRUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not breaking early, if there's the first AggregateSort found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot break after the first AggregateSort. Maybe we could break after the last AggregateSort.
If we have:
SELECT f1, f2, f3, MAX(f4) as max, MIN(f5) as min
FROM test
GROUP BY f1, f2, f3
ORDER BY f1, max, f2, min, f3
we cannot break after max
, we could break after min
.
I'd rather leave the fix as is and introduce this optimisation in a separate PR where it's properly tested that it works.
(Needs some carefully chosen data set to test this ordering case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. That's a simple loop that, when finds an AggregateSort, will set customSort
to TRUE
. It doesn't really matter what's after in the list of sort
s because it doesn't change the value of customSort
.
Also, I meant breaking from inside the loop not from the method...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, misunderstood you there, sure we should break once the 1st is found.
break; | ||
} | ||
} | ||
if (atIndex==-1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atIndex == -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I broke the formatting :(
throw new SqlIllegalArgumentException("Cannot find backing column for ordering aggregation [{}]", s); | ||
} | ||
// assemble a comparator for it | ||
Comparator comp = s.direction()==Sort.Direction.ASC ? Comparator.naturalOrder():Comparator.reverseOrder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparator comp = s.direction() == Sort.Direction.ASC ? Comparator.naturalOrder() : Comparator.reverseOrder();
} | ||
// assemble a comparator for it | ||
Comparator comp = s.direction()==Sort.Direction.ASC ? Comparator.naturalOrder():Comparator.reverseOrder(); | ||
comp = s.missing()==Sort.Missing.FIRST ? Comparator.nullsFirst(comp):Comparator.nullsLast(comp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comp = s.missing() == Sort.Missing.FIRST ? Comparator.nullsFirst(comp) : Comparator.nullsLast(comp);
@@ -8,13 +8,22 @@ | |||
import java.util.Objects; | |||
|
|||
public class ScoreSort extends Sort { | |||
public ScoreSort(Direction direction, Missing missing) { | |||
|
|||
final String id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you explore the idea of having the id
as part of Sort
, as I'm seeing a lot of repeated code? (skimming through this PR's changes, the id
seems to be introduced in all classes inheriting Sort
). I am probably missing something that prevents this approach to be used, would love to hear the reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided not to, because 2 classes the AttributeSort
and AggregateSort
can extract the id from their attribute variables. On the other hand 3 classes need to store it, I just chose the 1st approach with forcing to implement the public String id()
getter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have done it differently. I feel like id
and id()
should belong to the same base class - Sort
, with the help of an additional constructor (that receives the id as an argument). Sort
would have also provided a default implementation for id()
(to return the id
) and the other classes would have overriden the default implementation of id()
where appropriate. Also, these "special" cases classes would have called the aforementioned freshly added constructor in Sort
.
The main bothering aspect for me is that there is a required id()
method in Sort
, but the id
itself (that could be related to this id()
method) lives in the inheriting classes.
You can leave it as is, I just wanted to point out something that doesn't feel right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
it LGTM, but there are aspects I didn't take time to delve into. |
I might be missing something so bear with me. // if a sort item is not in the list, it is assumed the sorting happened in ES
// and the results are left as is (by using the row ordering), otherwise it is
// sorted based on the given criteria.
//
// Take for example ORDER BY a, x, b, y
// a, b - are sorted in ES
// x, y - need to be sorted client-side
// sorting on x kicks in, only if the values for a are equal. this means the query
gets translated into What's incorrect with this approach (an example would help). Thanks. |
Take for example this query that you mentioned:
We receive the rows ordered by
The test fails with the commented out line (comparator on the 1st column) and the first row returned Another solution, instead of passing all the comparators, would be to change the implementation of the queue. When a comparator is encountered for column |
Here is a proposed solution: https://gist.github.com/matriv/1f3d6dc3150e5ff231598ec4c68be8b1 |
The approach above doesn't work. The problem is that we have for every result row all the returned columns (can be more than the ones used for ordering) and we lack the information of which rows were involved in ordering. As far as I can think, we need this information, because we need to apply the AggSorting on column Maybe you have some better idea though, or maybe I'm missing something in my whole approach. |
@astefan @costin After discussion, pushed a slightly different approach, where we don't pass the comparators for ES pre-ordered columns, but just the indices of those columns (on the output rows) with a Added a couple of integ tests for histograms and and a couple more unit tests for the AggSortingQueue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for your patience while looking into this one.
if (comparator != null) { | ||
int result = comparator.compare(vl, vr); | ||
// if things are equals, move to the next comparator | ||
// if things are not equal: return the comparison result, | ||
// or else: move to the next comparator to solve the tie. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: or else -> else or otherwise
@costin Thx for your help to tackle the issue correctly! |
Previously, in the in-memory sorting module `LocalAggregationSorterListener` only the aggregate functions where used (grabbed by the `sortingColumns`). As a consequence, if the ORDER BY was also using columns of the GROUP BY clause, (especially in the case of higher priority - before the aggregate functions) wrong results were produced. E.g.: ``` SELECT gender, MAX(salary) AS max FROM test_emp GROUP BY gender ORDER BY gender, max ``` Add all columns of the ORDER BY to the `sortingColumns` so that the `LocalAggregationSorterListener` can use the correct comparators in the underlying PriorityQueue used to implement the in-memory sorting. Fixes: #50355 (cherry picked from commit be680af)
Previously, in the in-memory sorting module `LocalAggregationSorterListener` only the aggregate functions where used (grabbed by the `sortingColumns`). As a consequence, if the ORDER BY was also using columns of the GROUP BY clause, (especially in the case of higher priority - before the aggregate functions) wrong results were produced. E.g.: ``` SELECT gender, MAX(salary) AS max FROM test_emp GROUP BY gender ORDER BY gender, max ``` Add all columns of the ORDER BY to the `sortingColumns` so that the `LocalAggregationSorterListener` can use the correct comparators in the underlying PriorityQueue used to implement the in-memory sorting. Fixes: #50355 (cherry picked from commit be680af)
Previously, in the in-memory sorting module `LocalAggregationSorterListener` only the aggregate functions where used (grabbed by the `sortingColumns`). As a consequence, if the ORDER BY was also using columns of the GROUP BY clause, (especially in the case of higher priority - before the aggregate functions) wrong results were produced. E.g.: ``` SELECT gender, MAX(salary) AS max FROM test_emp GROUP BY gender ORDER BY gender, max ``` Add all columns of the ORDER BY to the `sortingColumns` so that the `LocalAggregationSorterListener` can use the correct comparators in the underlying PriorityQueue used to implement the in-memory sorting. Fixes: #50355 (cherry picked from commit be680af)
Previously, in the in-memory sorting module
LocalAggregationSorterListener
only the aggregate functions where used(grabbed by the
sortingColumns
). As a consequence, if the ORDER BYwas also using columns of the GROUP BY clause, (especially in the case
of higher priority - before the aggregate functions) wrong results were
produced. E.g.:
Add all columns of the ORDER BY to the
sortingColumns
so that theLocalAggregationSorterListener
can use the correct comparators inthe underlying PriorityQueue used to implement the in-memory sorting.
Fixes: #50355