Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decorrelate_where_in reports error when optimizing limit subquery #5808

Closed
Tracked by #5483 ...
HaoYang670 opened this issue Mar 31, 2023 · 6 comments · Fixed by #6457
Closed
Tracked by #5483 ...

decorrelate_where_in reports error when optimizing limit subquery #5808

HaoYang670 opened this issue Mar 31, 2023 · 6 comments · Fixed by #6457
Labels
bug Something isn't working

Comments

@HaoYang670
Copy link
Contributor

Describe the bug

decorrelate_where_in currently only support Predicate as the top level plan in the sub-queries, otherwise it will return an error:
https://github.com/apache/arrow-datafusion/blob/667f19ebad216b7592af5a91b70a24fb21c3bb64/datafusion/optimizer/src/decorrelate_where_in.rs#L151-L152

However, for limit subquery, the top level plan might be Limit which let decorrelate_where_in fail.

To Reproduce

Set skip_failed_rules to false and run the test support_limit_subquery, you will fail on the test with the error message

Error: Context("decorrelate_where_in", Internal("Optimizer rule 'decorrelate_where_in' failed due to unexpected error: a projection is required at datafusion/optimizer/src/decorrelate_where_in.rs:152\ncaused by\nError during planning: Could not coerce into Projection! at datafusion/expr/src/logical_plan/plan.rs:1293"))
test sql::subqueries::support_limit_subquery ... FAILED

Expected behavior

No error should be generated. At least, we can let decorrelate_where_in return Ok(None)

Additional context

No response

@HaoYang670 HaoYang670 added the bug Something isn't working label Mar 31, 2023
@HaoYang670
Copy link
Contributor Author

Hi @avantgardnerio @mingmwang, could you please help to take a look if you have time?

@mingmwang
Copy link
Contributor

mingmwang commented Mar 31, 2023

Sure, I will take a look. It is tricky to support decorate the correlated In/Exist subqueries which contains Limit/OrderBy clauses. I remember SparkSQL will report error in the case, PostgreSQL will not report error but will not decorate and just keep the nested subquery. Hyper is more advanced to handle this.

In DataFusion, if we want to support this, we need to think and test all the difference cases carefully:

-- Expected behavior: can be de-correlated, limit must be removed
explain  
SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE t2.id = t1.id limit 1);

-- Expected behavior: can be de-correlated, should keep the inner limit and must remove the outer limit
explain  
SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT * FROM (SELECT * FROM t2 limit 10) as t2 WHERE t2.id = t1.id limit 1);

-- Expected behavior: can be de-correlated, must keep the limit
explain 
SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 limit 10);

-- Expected behavior: can not be de-correlated, must keep limit
explain  
SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 where t1.name = t2.name limit 10)

@mingmwang
Copy link
Contributor

Why it is tricky is because Subquery can be think of as a specific kind of nested loop join, the join condition is very specific and contains limit, the de-correlation process can be consider to push down the joins and covert the nested loop join to a normal join without limit as the join condition, it changes the evaluation ordering of the original operators, removing or keeping the limit in the re-written query will impact the correctness.

@avantgardnerio
Copy link
Contributor

take a look if you have time

Unfortunately my availability is low right now. If @mingmwang 's claim is correct (which I have no reason to doubt) that:

SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 where t1.name = t2.name limit 10)

can not be de-correlated

then I think we'll need to have the ability to execute plans even if this rule fails (i.e. nested loop execution). I don't think I ever intended it to decorrelate all subqueries - it was designed to hit the 80% case and get TPC-H working.

At the time, returning an error was considered the proper thing to do. The API changed so now the rule needs to be updated to plumb Ok(None) down through all the layers of recursion, which can be verbose and non-trivial.

My recommendation at the time (which I would still assert) is that it would make the life of optimizer rule authors considerably simpler if we add a DataFusionError::CanNotOptimize error and simply return that in this case, which would get treated the same as Ok(None) so it keeps the code readable and simplifies plumbing.

@avantgardnerio
Copy link
Contributor

Taking a look now... how do I:

Set skip_failed_rules to false

?

@HaoYang670
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants