Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Allow sorting of groups by aggregates #38042

Merged
merged 12 commits into from
Feb 1, 2019
Merged

Conversation

costin
Copy link
Member

@costin costin commented Jan 30, 2019

Introduce client-side, custom sorting of groups based on aggregate
functions. To allow this, the Analyzer has been extended to push down
to underlying Aggregate, aggregate function and the Querier has been
extended to identify the case and consume the results in order and sort
them based on the given columns.
The underlying QueryContainer has been slightly modified to allow a view
of the underlying values being extracted as the columns used for sorting
might not be requested by the user.

The PR also adds minor tweaks, mainly related to tree output.

Close #35118

Introduce client-side, custom sorting of groups based on aggregate
functions. To allow this, the Analyzer has been extended to push down
to underlying Aggregate, aggregate function and the Querier has been
extended to identify the case and consume the results in order and sort
them based on the given columns.
The underlying QueryContainer has been slightly modified to allow a view
of the underlying values being extracted as the columns used for sorting
might not be requested by the user.

The PR also adds minor tweaks, mainly related to tree output.

Close elastic#35118
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@costin
Copy link
Member Author

costin commented Jan 30, 2019

There are still some things that need to be added in a future PR, mainly:

  1. move the metrics from the transport layer to the planExecutor so in case of sorting, the stats are properly incremented (since multiple requests are done).
  2. figure out the policy for establishing the maximum number of items to be sorted. Currently the PR uses an arbitrary number (100) however this requires investigation.
    While initially I was favoring the idea of the user being required to specify a limit - LIMIT 100 I think that's quite annoying and in most cases we'll have an upper limit anyway.
    So another variant is to always allocate the higher limit in case no LIMIT is specified and throw an exception at runtime if the limit is reached to tell the user that they have to specify a manual LIMIT.

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, Really important feature!
Left some comments and additionally:

  1. Please add tests with ordering on multiple columns (tie breaker)
  2. Please add tests with ordering using number indexes (e.g.: ORDER BY 1, 2)
  3. Please add tests with descending ordering
  4. Please remove the paragraph from the limitations doc page.

aggWithAlias
SELECT MAX(salary) AS m FROM test_emp GROUP BY gender ORDER BY m;

multipleAggsThatGetRewrittenWithoutAlias
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean Rewritten?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max and avg and count get rewritten/replaced with a Stats agg.

}

// 2. find first Aggregate child and update it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove empty line

@@ -310,11 +310,31 @@ private static boolean checkGroupByOrder(LogicalPlan p, Set<Failure> localFailur
Aggregate a = (Aggregate) child;

Map<Expression, Node<?>> missing = new LinkedHashMap<>();

// track aggs and non-aggs - to keep the final modifier, use an array
final Expression[] aggAndNonAgg = new Expression[2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a boolean[] instead here, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because if found, the expression is used in the validation message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, missed that part


// 2. find first Aggregate child and update it

final AtomicBoolean found = new AtomicBoolean(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Atomic needed here? is there some kind of concurrency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not it's an utility class for storing state outside a lamba expression. Since the class is already part of the JDK (so it's likely to be inlined) it's quite convenient. The alternative is to create our own Holder class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it thx!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would opt for a boolean[] found = new boolean[] {false} instead, to avoid confusion and because AtomicBoolean has special semantics for the compiler to achieve the atomic operations in CPU level which is a bit too much, imho, just for convenience.

/**
* Listener used for local sorting (typically due to aggregations used inside `ORDER BY`).
*
* This listener consumes the whole result set, sorts it in memory then send the paginates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...then sends the paginated results back to the client

// if the sort points to an agg, consider it only if there's grouping
if (ob.child() instanceof Aggregate) {
Aggregate a = (Aggregate) ob.child();
AtomicBoolean foundAggregate = new AtomicBoolean(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too? why AtomicBoolean?

Querier scroller = new Querier(session.client(), session.configuration());
scroller.query(Rows.schema(output), queryContainer, index, listener);
Querier scroller = new Querier(session);
// List<ExpressionId> ids = queryContainer.columns();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover?

public void testGroupByOrderByNonKey() {
assertEquals("1:52: Cannot order by non-grouped column [a], expected [bool]",
error("SELECT AVG(int) a FROM test GROUP BY bool ORDER BY a"));
// public void testGroupByOrderByNonKey() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover, please remove.

@@ -64,22 +64,22 @@ public void testExplainWithWhere() throws IOException {
assertThat(readLine(), startsWith("----------"));
assertThat(readLine(), startsWith("With[{}]"));
assertThat(readLine(), startsWith("\\_Project[[?*]]"));
assertThat(readLine(), startsWith(" \\_Filter[i = 2#"));
assertThat(readLine(), startsWith(" \\_UnresolvedRelation[[][index=test],null,Unknown index [test]]"));
assertThat(readLine(), startsWith(" \\_Filter[Equals[?i"));
Copy link
Contributor

@matriv matriv Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in the other places: Can't we also still check the 2?

Copy link
Member Author

@costin costin Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can for unresolved items but for resolved ones not really since the tests are ran in random order and the ids change between runs. Fwiw, the final generated query is tested and the value checked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important at all but can't we use endsWith?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because 2 is a literal with its own id.

// Custom sorting/ordering on aggregates
//

countWithImplicitGroupBy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find the value of having all these different cases without group by, as only 1 value is returned and the ORDER BY doesn't really apply, or am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed but there's a dedicated rule in the optimizer for this case that excludes the order (as there's no physical group for it).

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff. Left few minor comments

@@ -310,11 +310,31 @@ private static boolean checkGroupByOrder(LogicalPlan p, Set<Failure> localFailur
Aggregate a = (Aggregate) child;

Map<Expression, Node<?>> missing = new LinkedHashMap<>();

// track aggs and non-aggs - to keep the final modifier, use an array
final Expression[] aggAndNonAgg = new Expression[2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't have been more clear/easier to manage with two separate Expressions named accordingly? (ie agg/nonAgg)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because one would have two arrays - this is really a hack for the setting the state outside a lamba expression. I'll probably go for a Holder class which be more meaningful in intent.

/**
* Listener used for local sorting (typically due to aggregations used inside `ORDER BY`).
*
* This listener consumes the whole result set, sorts it in memory then send the paginates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"send AND paginates..." you meant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sends the paginated" response. Fixed.

LocalAggregationSorterListener(ActionListener<SchemaRowSet> listener, List<Tuple<Integer, Comparator>> sortingColumns, int limit) {
this.listener = listener;

this.data = new PriorityQueue<Tuple<List<?>, Integer>>(Math.max(limit, 100)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract the 100 value as a constant to be easier to change in the future? (very likely we will change it or allow the user to configure a default value)

Querier scroller = new Querier(session.client(), session.configuration());
scroller.query(Rows.schema(output), queryContainer, index, listener);
Querier scroller = new Querier(session);
// List<ExpressionId> ids = queryContainer.columns();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I've removed them

}

public void testGroupByOrderByFunctionOverKey() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing this method? Why not only adding the new one?

@@ -312,14 +312,19 @@ public void testUnsupportedTypeInOrder() {
error("SELECT * FROM test ORDER BY unsupported"));
}

public void testGroupByOrderByNonKey() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing this test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a bit of reworking.

@astefan
Copy link
Contributor

astefan commented Jan 31, 2019

One idea for a test (from the odbc tests): select gender as g, sum(salary) as s from test_emp group by 1 order by 2 desc limit 3;

@astefan
Copy link
Contributor

astefan commented Jan 31, 2019

sql> select gender as g, min(salary) as min, max(salary) as max from test_emp group by 1 order by 2, 3, 2, 4 desc limit 3;
Bad request [line 1:103: Invalid %d specified in OrderBy (valid range is [1, %d])]

assertEquals("1:44: Cannot order by non-grouped column [MAX(int)], expected [int]",
error("SELECT int FROM test GROUP BY int ORDER BY MAX(int)"));
public void testGroupByOrderByAggAndGroupedColumn() {
assertEquals("1:49: Cannot order by aggregated [MAX(int)] and non-aggregated [int] columns"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test where ORDER BY has more than two elements?

@matriv
Copy link
Contributor

matriv commented Jan 31, 2019

I would also add the 100 implicit limit to the docs.

@costin
Copy link
Member Author

costin commented Jan 31, 2019

@matriv see number 2) in my initial comment.

@matriv
Copy link
Contributor

matriv commented Jan 31, 2019

@costin yep read the comment, I was just thinking to have like an autonomous complete PR. But I'm totally fine to add the proper docs once this bit is figured out and implemented correctly. Personally, if no LIMIT is provided by the user, I'd favour to have a documented implicit limit rather than exhausting available memory and throw an Exception. On the other hand if the latter can implemented easily with the use of a CircuitBreaker maybe it's a nice option too.

@costin
Copy link
Member Author

costin commented Feb 1, 2019

sql> select gender as g, min(salary) as min, max(salary) as max from test_emp group by 1 order by 2, 3, 2, 4 desc limit 3;
Bad request [line 1:103: Invalid %d specified in OrderBy (valid range is [1, %d])]

I've created a separate issue for this since it needs backporting to 6.x. It's not just the message but also the fact that the Analyzer rule throws an exception (with an incorrect message).
See #38208

@costin
Copy link
Member Author

costin commented Feb 1, 2019

@costin yep read the comment, I was just thinking to have like an autonomous complete PR. But I'm totally fine to add the proper docs once this bit is figured out and implemented correctly. Personally, if no LIMIT is provided by the user, I'd favour to have a documented implicit limit rather than exhausting available memory and throw an Exception. On the other hand if the latter can implemented easily with the use of a CircuitBreaker maybe it's a nice option too.

In case the LIMIT is not specified, the hard limit (256) is enforced however the moment an entry is dropped, an exception is thrown.
The idea being that we can sort 256 entries however more than that and we lose some records - and appropriately tell the user instead of silently applying that.

@costin
Copy link
Member Author

costin commented Feb 1, 2019

@astefan @matriv I've addressed more comments, beefed up the test suite (with some tie breakers) and addressed point 2).
There will be a follow-up PR regarding the move of metrics and one for the docs to add some examples.

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great feature.

@costin
Copy link
Member Author

costin commented Feb 1, 2019

@elasticmachine run elasticsearch-ci/packaging-sample

@costin costin merged commit 783c9ed into elastic:master Feb 1, 2019
costin added a commit to costin/elasticsearch that referenced this pull request Feb 2, 2019
Introduce client-side sorting of groups based on aggregate
functions. To allow this, the Analyzer has been extended to push down
to underlying Aggregate, aggregate function and the Querier has been
extended to identify the case and consume the results in order and sort
them based on the given columns.
The underlying QueryContainer has been slightly modified to allow a view
of the underlying values being extracted as the columns used for sorting
might not be requested by the user.

The PR also adds minor tweaks, mainly related to tree output.

Close elastic#35118

(cherry picked from commit 783c9ed)
costin added a commit that referenced this pull request Feb 2, 2019
Introduce client-side sorting of groups based on aggregate
functions. To allow this, the Analyzer has been extended to push down
to underlying Aggregate, aggregate function and the Querier has been
extended to identify the case and consume the results in order and sort
them based on the given columns.
The underlying QueryContainer has been slightly modified to allow a view
of the underlying values being extracted as the columns used for sorting
might not be requested by the user.

The PR also adds minor tweaks, mainly related to tree output.

Close #35118

(cherry picked from commit 783c9ed)
@costin costin deleted the fix-35118 branch July 6, 2020 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants