Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Disallow queries with inner LIMIT that cannot be executed in ES #75960

Merged
merged 10 commits into from
Aug 9, 2021

Conversation

Luegg
Copy link
Contributor

@Luegg Luegg commented Aug 2, 2021

Resolves #75830

This PR only disallows subqueries with LIMIT if they cannot be translated into an semantically equivalent ES query. Thus, queries like SELECT languages FROM (SELECT * FROM test_emp LIMIT 10) that can be flattened correctly are still valid. Compared to disallowing all subqueries with LIMIT, this will only break queries that might return wrong results.

Additionally, this change also uncovered an issue with TOP. Because TOP is part of the querySpecification syntax, the LogicalPlanBuilder creates it as a child of OrderBy nodes. Thus, the query SELECT TOP 10 * FROM test_emp ORDER BY salary used to produce the logical plan

\ OrderBy
 \ Limit
  \ EsRelation[test_emp]

the semnatically correct plan would be

\ Limit
 \ OrderBy
  \ EsRelation[test_emp]

This also had the effect that some queries with both TOP and LIMIT were not rejected by the parser.

@Luegg Luegg changed the title Disallow queries with inner LIMIT that cannot be executed correctly SQL: Disallow queries with inner LIMIT that cannot be executed in ES Aug 2, 2021
@Luegg Luegg requested review from astefan, bpintea and costin August 2, 2021 15:03
@Luegg Luegg marked this pull request as ready for review August 2, 2021 15:03
@Luegg Luegg added the :Analytics/SQL SQL querying label Aug 2, 2021
@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Aug 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

if (unary.child() instanceof Limit) {
Limit limit = (Limit) unary.child();

return limit.replaceChildrenSameSize(Arrays.asList(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Collections#singletonList() might work too.

Comment on lines -156 to -157
selectOrderByLimitSameOrderBy
SELECT * FROM (SELECT * FROM (SELECT * FROM test_emp LIMIT 10) ORDER BY emp_no) ORDER BY emp_no LIMIT 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this (type of) sequence of sub-selects couldn't be flattened safely. The outer ones don't change the ordering and out of multiple limitations, the smallest value could be determined (there's already code doing this comparison, IIRC?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this particular query cannot be flattened as well because it also first limits and then sorts.

But I think you're right that there is a special case where the outer ORDER BY is redundant and could be eliminated as in SELECT * FROM (SELECT * FROM test_emp ORDER BY emp_no LIMIT 10) ORDER BY emp_no. This query should be equivalent to SELECT * FROM (SELECT * FROM test_emp ORDER BY emp_no LIMIT 10) which can safely be flattened.

I think the same principle applies for redundant filters as well: SELECT * FROM (SELECT * FROM test_emp WHERE emp_no > 10 LIMIT 10) WHERE emp_no > 10. And maybe also GROUP BYs.

I have the feeling this would best be implemented as an additional optimization though. A rule that eliminates redundant WHERE and ORDER BY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that in this particular case, the query could even be flattened because the order is not specified in the inner query. This would be a somewhat surprising behavior but probably aligned with SQL's "unordered bag of tuples" semantics.

return;
}
if (p instanceof OrderExec) {
if (hasLimit.get() && orderBy.get() != null && ((OrderExec) p).order().equals(orderBy.get()) == false) {
Copy link
Contributor

@bpintea bpintea Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the removed query comment: can't we keep this "pass-through" check? I.e. chaining multiple equal orderings if they aren't combined with other clauses?

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on improving the complexity of the validation rule and the top parser.

@@ -119,6 +124,21 @@ public LogicalPlan visitQueryNoWith(QueryNoWithContext ctx) {
return plan;
}

private LogicalPlan pullUpTopClauseFromQuerySpecification(LogicalPlan plan) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a more focused approach in handling top. TOP is handled through visitQuerySpecification while ORDER BY is handled inside queryNoWith hence why one can check if the LIMIT exists in case of an OrderBy and do the switch then - replace the child of the Limit with the new OrderBy which points to the Limit child.


for (LimitExec limit : limits) {
failures.add(
fail(limit, "LIMIT or TOP cannot be used in a subquery if outer query contains GROUP BY, WHERE, ORDER BY or PIVOT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please order the upper cased keywords alphabetically (for consistency).

Set<LimitExec> limits = new TreeSet<>(Comparator.comparing(l -> l.source().text()));

plan.forEachDown(p -> {
if (p instanceof OrderExec || p instanceof FilterExec || p instanceof PivotExec || p instanceof AggregateExec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do double iteration - the main forEach already traverses the tree so for each matching node doing another p.forEachDown is unnecessary.
Further more I don't think it's useful to the user to see all the instances where the same type of error occurs since a query with ORDER BY and a WHERE clause will report the error twice (once per entity).

I favor the initial approach since it's O(n) - traverse the tree once. (this could be optimized by ending the traversal on the first error however it's a rare enough case that we can ignore it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the single iteration.

Also reporting only one error shouldn't make too much of a difference but I was more thinking about the cases where there is more than one offending Limit as in SELECT * FROM (SELECT * FROM (SELECT * FROM test_emp LIMIT 10) LIMIT 20) WHERE languages > 5. Since this seems to be a rather far-fetched case, the simpler one-error implementation should be fine.

@Luegg Luegg force-pushed the fix/limitInInnerQueries branch from 2c263c4 to 264e042 Compare August 3, 2021 14:07
@Luegg
Copy link
Contributor Author

Luegg commented Aug 3, 2021

@elasticmachine run elasticsearch-ci/bwc

1 similar comment
@Luegg
Copy link
Contributor Author

Luegg commented Aug 3, 2021

@elasticmachine run elasticsearch-ci/bwc

@Luegg
Copy link
Contributor Author

Luegg commented Aug 4, 2021

@costin and @bpintea I've addressed your comments and had a look at what Bogdan mentioned about queries with redundant sort orders (SELECT * FROM (SELECT * FROM test_emp ORDER BY emp_no LIMIT 10) ORDER BY emp_no). In theory, we could allow such queries but I'm not sure it's worth it.

The last commit is an attempt to allow these queries and also to address #76013 . Basically, it introduces a PushDownAndCombineOrderBy that folds multiple order bys if possible. The same logic could probably also be implemented in the QueryFolder but to me it looked more like an optimization.

@@ -165,6 +167,9 @@ public LogicalPlan optimize(LogicalPlan verified) {
new PushDownAndCombineFilters()
);

Batch pushDownOrderBy = new Batch("Push Down Order By",
new PushDownAndCombineOrderBy());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to run isolated from PushDownAndCombineFilters in order to avoid interfering push downs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the side effect? Keeping this rule in the main loop has the benefit of picking up plans where some orderby might be removed that would otherwise prevent the pushdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the optimizer would never converge to a stable plan in some cases as OrderBy and Filter would alternately be pushed down.

@@ -701,6 +701,10 @@ protected PhysicalPlan rule(OrderExec plan) {
EsQueryExec exec = (EsQueryExec) plan.child();
QueryContainer qContainer = exec.queryContainer();

if (qContainer.sort().isEmpty() == false) {
throw new SqlIllegalArgumentException("QueryContainer already defines sort order");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be defensive here because the naive approach to just prepend additional sort orders leads to #76013 .

Alternatively, FoldOrderBy (and Verifier) could take over the responsibility to merge OrderBys from PushDownAndCombineOrderBy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, FoldOrderBy (and Verifier) could take over the responsibility to merge OrderBys from PushDownAndCombineOrderBy.

I like this better.

@Luegg Luegg requested review from costin and bpintea August 4, 2021 08:35
Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last commit is an attempt to allow these queries and also to address #76013 .

Please address that through a different PR (and +1 at enhancing FoldOrderBy instead of a separate rule) to keep things clean.
Otherwise LGTM.

@@ -165,6 +167,9 @@ public LogicalPlan optimize(LogicalPlan verified) {
new PushDownAndCombineFilters()
);

Batch pushDownOrderBy = new Batch("Push Down Order By",
new PushDownAndCombineOrderBy());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the side effect? Keeping this rule in the main loop has the benefit of picking up plans where some orderby might be removed that would otherwise prevent the pushdown.

UnaryPlan child = (UnaryPlan) plan.child();
Holder<Boolean> hasAggregate = new Holder<>(false);
if (child instanceof Filter) {
child.forEachExpressionDown(e -> hasAggregate.set(hasAggregate.get() || Functions.isAggregate(e)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is compact but it also triggers the set/get for each expression.

if (Function.isAggregate(e)) {
    hasAggregate.set(true);
}

Is more verbose but sane.

}

if (hasAggregate.get() == false) {
return child.replaceChildrenSameSize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could take advantage of the replaceChild method.

return plan;
}

private List<Order> combineOrders(List<Order> parentOrders, List<Order> childOrders) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me how this method combines orders?

Source source = source(ctx.ORDER(), endContext);
List<Order> order = visitList(ctx.orderBy(), Order.class);

if (plan instanceof Limit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -701,6 +701,10 @@ protected PhysicalPlan rule(OrderExec plan) {
EsQueryExec exec = (EsQueryExec) plan.child();
QueryContainer qContainer = exec.queryContainer();

if (qContainer.sort().isEmpty() == false) {
throw new SqlIllegalArgumentException("QueryContainer already defines sort order");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, FoldOrderBy (and Verifier) could take over the responsibility to merge OrderBys from PushDownAndCombineOrderBy.

I like this better.

@Luegg
Copy link
Contributor Author

Luegg commented Aug 9, 2021

I've reverted the last commit and will try to address #76013 in a separate PR. This means that we will have a temporary regression as will no longer support queries with redundant sorts (e.g. SELECT * FROM (SELECT * FROM test_emp ORDER BY emp_no LIMIT 10) ORDER BY emp_no). Since it seems to be quite a special case, I would say that's justifiable.

@Luegg Luegg merged commit 5e8100b into elastic:master Aug 9, 2021
@Luegg Luegg deleted the fix/limitInInnerQueries branch August 9, 2021 07:44
@Luegg Luegg added the auto-backport Automatically create backport pull requests when merged label Aug 9, 2021
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Aug 9, 2021
…lastic#75960)

* Disallow queries with inner limits that cannot be executed correctly

* address review comments

* introduce PushDownAndCombineOrderBy optimization

* Revert "introduce PushDownAndCombineOrderBy optimization"

8d8d992

* use replaceChild
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Aug 9, 2021
…lastic#75960)

* Disallow queries with inner limits that cannot be executed correctly

* address review comments

* introduce PushDownAndCombineOrderBy optimization

* Revert "introduce PushDownAndCombineOrderBy optimization"

8d8d992

* use replaceChild
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
7.14
7.x

elasticsearchmachine added a commit that referenced this pull request Aug 9, 2021
…75960) (#76222)

* Disallow queries with inner limits that cannot be executed correctly

* address review comments

* introduce PushDownAndCombineOrderBy optimization

* Revert "introduce PushDownAndCombineOrderBy optimization"

8d8d992

* use replaceChild

Co-authored-by: Lukas Wegmann <[email protected]>
elasticsearchmachine added a commit that referenced this pull request Aug 9, 2021
…75960) (#76223)

* Disallow queries with inner limits that cannot be executed correctly

* address review comments

* introduce PushDownAndCombineOrderBy optimization

* Revert "introduce PushDownAndCombineOrderBy optimization"

8d8d992

* use replaceChild

Co-authored-by: Lukas Wegmann <[email protected]>
@Luegg Luegg mentioned this pull request Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/SQL SQL querying auto-backport Automatically create backport pull requests when merged >bug Team:QL (Deprecated) Meta label for query languages team v7.14.1 v7.15.0 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: LIMIT in subquery with outer WHERE or ORDER BY gives wrong result
6 participants