Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Fix ORDER BY on aggregates and GROUPed BY fields #51894

Merged
merged 10 commits into from
Feb 12, 2020

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Feb 4, 2020

Previously, in the in-memory sorting module
LocalAggregationSorterListener only the aggregate functions where used
(grabbed by the sortingColumns). As a consequence, if the ORDER BY
was also using columns of the GROUP BY clause, (especially in the case
of higher priority - before the aggregate functions) wrong results were
produced. E.g.:

SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max

Add all columns of the ORDER BY to the sortingColumns so that the
LocalAggregationSorterListener can use the correct comparators in
the underlying PriorityQueue used to implement the in-memory sorting.

Fixes: #50355

Previously, in the in-memory sorting module
`LocalAggregationSorterListener` only the aggregate functions where used
(grabbed by the `sortingColumns`). As a consequence, if the ORDER BY
was also using columns of the GROUP BY clause, (especially in the case
of higher priority - before the aggregate functions) wrong results were
produced. E.g.:
```
SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max
```

Add all columns of the ORDER BY to the `sortingColumns` so that the
`LocalAggregationSorterListener` can use the correct comparators in
the underlying PriorityQueue used to implement the in-memory sorting.

Fixes: elastic#50355
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/SQL)

@matriv
Copy link
Contributor Author

matriv commented Feb 4, 2020

I'm thinking that the only optimisation we can do is maybe skip the columns after the last aggregate functions in the ORDER BY, For example in

SELECT f1, f2, f3, MIN(f4) AS min, MAX(f5) AS max FROM test
GROUP BY f1, f2, f3
ORDER BY f1, max, f2, min, f3

The f3 can be skipped since the rows are processed already with the ordering defined by f1, f2, f3.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

comp = s.missing() == Sort.Missing.FIRST ? Comparator.nullsFirst(comp) : Comparator.nullsLast(comp);

tuple = new Tuple<>(Integer.valueOf(atIndex), comp);
customSort = Boolean.TRUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not breaking early, if there's the first AggregateSort found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot break after the first AggregateSort. Maybe we could break after the last AggregateSort.
If we have:

SELECT f1, f2, f3, MAX(f4) as max, MIN(f5) as min
FROM test
GROUP BY f1, f2, f3
ORDER BY f1, max, f2, min, f3

we cannot break after max, we could break after min.

I'd rather leave the fix as is and introduce this optimisation in a separate PR where it's properly tested that it works.
(Needs some carefully chosen data set to test this ordering case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. That's a simple loop that, when finds an AggregateSort, will set customSort to TRUE. It doesn't really matter what's after in the list of sorts because it doesn't change the value of customSort.
Also, I meant breaking from inside the loop not from the method...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, misunderstood you there, sure we should break once the 1st is found.

break;
}
}
if (atIndex==-1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atIndex == -1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I broke the formatting :(

throw new SqlIllegalArgumentException("Cannot find backing column for ordering aggregation [{}]", s);
}
// assemble a comparator for it
Comparator comp = s.direction()==Sort.Direction.ASC ? Comparator.naturalOrder():Comparator.reverseOrder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparator comp = s.direction() == Sort.Direction.ASC ? Comparator.naturalOrder() : Comparator.reverseOrder();

}
// assemble a comparator for it
Comparator comp = s.direction()==Sort.Direction.ASC ? Comparator.naturalOrder():Comparator.reverseOrder();
comp = s.missing()==Sort.Missing.FIRST ? Comparator.nullsFirst(comp):Comparator.nullsLast(comp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comp = s.missing() == Sort.Missing.FIRST ? Comparator.nullsFirst(comp) : Comparator.nullsLast(comp);

@@ -8,13 +8,22 @@
import java.util.Objects;

public class ScoreSort extends Sort {
public ScoreSort(Direction direction, Missing missing) {

final String id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you explore the idea of having the id as part of Sort, as I'm seeing a lot of repeated code? (skimming through this PR's changes, the id seems to be introduced in all classes inheriting Sort). I am probably missing something that prevents this approach to be used, would love to hear the reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to, because 2 classes the AttributeSort and AggregateSort can extract the id from their attribute variables. On the other hand 3 classes need to store it, I just chose the 1st approach with forcing to implement the public String id() getter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have done it differently. I feel like id and id() should belong to the same base class - Sort, with the help of an additional constructor (that receives the id as an argument). Sort would have also provided a default implementation for id() (to return the id) and the other classes would have overriden the default implementation of id() where appropriate. Also, these "special" cases classes would have called the aforementioned freshly added constructor in Sort.

The main bothering aspect for me is that there is a required id() method in Sort, but the id itself (that could be related to this id() method) lives in the inheriting classes.

You can leave it as is, I just wanted to point out something that doesn't feel right to me.

@matriv matriv requested a review from astefan February 5, 2020 12:15
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bpintea
Copy link
Contributor

bpintea commented Feb 7, 2020

it LGTM, but there are aspects I didn't take time to delve into.

@costin
Copy link
Member

costin commented Feb 8, 2020

I might be missing something so bear with me.
The Comparator does not sort on all fields on purpose - to quote the implementation:

// if a sort item is not in the list, it is assumed the sorting happened in ES
// and the results are left as is (by using the row ordering), otherwise it is 
// sorted based on the given criteria.
//
// Take for example ORDER BY a, x, b, y
// a, b - are sorted in ES
// x, y - need to be sorted client-side
// sorting on x kicks in, only if the values for a are equal.

this means the query

SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max

gets translated into ORDER BY gender followed by a client-side order by max which acts as a tie-breaker inside equal gender.
So from the AggSortingQueue, gender has an equality comparator which only triggers the max one.

What's incorrect with this approach (an example would help). Thanks.

@matriv
Copy link
Contributor Author

matriv commented Feb 8, 2020

Take for example this query that you mentioned:

SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max

We receive the rows ordered by gender.
Currently (before this PR) the AggSortingQueue will only have a comparator for index 1 (the max field). The lessThan() that does the "magic", iterates over the comparators, the 1st one it finds is the comparator for max and it applies it independently of the values of the 2 rows for the gender (position 0), so it messes up the ordering. Checkout the following example:

@SuppressWarnings("rawtypes")
public void testAggSorting_TwoFields_LALA() {
    List<Tuple<Integer, Comparator>> tuples = new ArrayList<>(2);
//        tuples.add(new Tuple<>(0, Comparator.reverseOrder()));
    tuples.add(new Tuple<>(1, Comparator.reverseOrder()));
    Querier.AggSortingQueue queue = new AggSortingQueue(10, tuples);

    for (int i = 1; i <= 100; i++) {
        queue.insertWithOverflow(new Tuple<>(Arrays.asList(100 - i + 1, i), i));
    }
    List<List<?>> results = queue.asList();

    assertEquals(10, results.size());
    for (int i = 0; i < 10; i++) {
        assertEquals(100 - i, results.get(i).get(0));
        assertEquals(i + 1, results.get(i).get(1));
    }
}

The test fails with the commented out line (comparator on the 1st column) and the first row returned
is [1, 100] instead of [100, 1].

Another solution, instead of passing all the comparators, would be to change the implementation of the queue. When a comparator is encountered for column i, it needs to be applied only if the values for all the previous columns [0 ... i - 1] are the same, otherwise do nothing and leave ordering as is. Or maybe another approach where we iterate over the columns and not the comparators. I'll check it out if it makes sense, keeping in mind the performance.

@matriv
Copy link
Contributor Author

matriv commented Feb 9, 2020

Here is a proposed solution: https://gist.github.com/matriv/1f3d6dc3150e5ff231598ec4c68be8b1
Need to check the NULLS FIRST, NULLS LAST handling too though.

@matriv
Copy link
Contributor Author

matriv commented Feb 9, 2020

The approach above doesn't work. The problem is that we have for every result row all the returned columns (can be more than the ones used for ordering) and we lack the information of which rows were involved in ordering. As far as I can think, we need this information, because we need to apply the AggSorting on column n only if all columns [0 .. n-1] are equal. Therefore we need to pass this info to the AggSortingQueue. We could do something like passing only the integer indices of the columns involved in the ORDER BY. Then use this info to check that values for columns [0 . .n-1]are equal and then proceed on using the AggSorting comparators. So we also need to do this equals check on the previous columns, which leads me to prefer passing all the comparators involved instead of those column indices and use Objects.equals(leftValue, rightValue().

Maybe you have some better idea though, or maybe I'm missing something in my whole approach.

@matriv matriv requested a review from astefan February 11, 2020 18:10
@matriv
Copy link
Contributor Author

matriv commented Feb 11, 2020

@astefan @costin After discussion, pushed a slightly different approach, where we don't pass the comparators for ES pre-ordered columns, but just the indices of those columns (on the output rows) with a null comparator and keep Objects.equals() to find ties.

Added a couple of integ tests for histograms and and a couple more unit tests for the AggSortingQueue.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for your patience while looking into this one.

if (comparator != null) {
int result = comparator.compare(vl, vr);
// if things are equals, move to the next comparator
// if things are not equal: return the comparison result,
// or else: move to the next comparator to solve the tie.
Copy link
Member

@costin costin Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: or else -> else or otherwise

@matriv
Copy link
Contributor Author

matriv commented Feb 11, 2020

@costin Thx for your help to tackle the issue correctly!

@matriv matriv merged commit be680af into elastic:master Feb 12, 2020
@matriv matriv deleted the fix-50355 branch February 12, 2020 08:37
matriv added a commit that referenced this pull request Feb 12, 2020
Previously, in the in-memory sorting module
`LocalAggregationSorterListener` only the aggregate functions where used
(grabbed by the `sortingColumns`). As a consequence, if the ORDER BY
was also using columns of the GROUP BY clause, (especially in the case
of higher priority - before the aggregate functions) wrong results were
produced. E.g.:
```
SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max
```

Add all columns of the ORDER BY to the `sortingColumns` so that the
`LocalAggregationSorterListener` can use the correct comparators in
the underlying PriorityQueue used to implement the in-memory sorting.

Fixes: #50355
(cherry picked from commit be680af)
matriv added a commit that referenced this pull request Feb 12, 2020
Previously, in the in-memory sorting module
`LocalAggregationSorterListener` only the aggregate functions where used
(grabbed by the `sortingColumns`). As a consequence, if the ORDER BY
was also using columns of the GROUP BY clause, (especially in the case
of higher priority - before the aggregate functions) wrong results were
produced. E.g.:
```
SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max
```

Add all columns of the ORDER BY to the `sortingColumns` so that the
`LocalAggregationSorterListener` can use the correct comparators in
the underlying PriorityQueue used to implement the in-memory sorting.

Fixes: #50355
(cherry picked from commit be680af)
matriv added a commit that referenced this pull request Feb 12, 2020
Previously, in the in-memory sorting module
`LocalAggregationSorterListener` only the aggregate functions where used
(grabbed by the `sortingColumns`). As a consequence, if the ORDER BY
was also using columns of the GROUP BY clause, (especially in the case
of higher priority - before the aggregate functions) wrong results were
produced. E.g.:
```
SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max
```

Add all columns of the ORDER BY to the `sortingColumns` so that the
`LocalAggregationSorterListener` can use the correct comparators in
the underlying PriorityQueue used to implement the in-memory sorting.

Fixes: #50355
(cherry picked from commit be680af)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: ORDER BY column and aggregate orders only by aggregate
6 participants