-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decorrelate_where_in
reports error when optimizing limit subquery
#5808
Comments
Hi @avantgardnerio @mingmwang, could you please help to take a look if you have time? |
Sure, I will take a look. It is tricky to support decorate the correlated In/Exist subqueries which contains In DataFusion, if we want to support this, we need to think and test all the difference cases carefully: -- Expected behavior: can be de-correlated, limit must be removed
explain
SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE t2.id = t1.id limit 1);
-- Expected behavior: can be de-correlated, should keep the inner limit and must remove the outer limit
explain
SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT * FROM (SELECT * FROM t2 limit 10) as t2 WHERE t2.id = t1.id limit 1);
-- Expected behavior: can be de-correlated, must keep the limit
explain
SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 limit 10);
-- Expected behavior: can not be de-correlated, must keep limit
explain
SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 where t1.name = t2.name limit 10)
|
Why it is tricky is because Subquery can be think of as a specific kind of nested loop join, the join condition is very specific and contains limit, the de-correlation process can be consider to push down the joins and covert the nested loop join to a normal join without limit as the join condition, it changes the evaluation ordering of the original operators, removing or keeping the limit in the re-written query will impact the correctness. |
Unfortunately my availability is low right now. If @mingmwang 's claim is correct (which I have no reason to doubt) that:
then I think we'll need to have the ability to execute plans even if this rule fails (i.e. nested loop execution). I don't think I ever intended it to decorrelate all subqueries - it was designed to hit the 80% case and get TPC-H working. At the time, returning an error was considered the proper thing to do. The API changed so now the rule needs to be updated to plumb My recommendation at the time (which I would still assert) is that it would make the life of optimizer rule authors considerably simpler if we add a |
Taking a look now... how do I:
? |
Hi @avantgardnerio, you could just set the default value to false: , and rerun |
Describe the bug
decorrelate_where_in
currently only supportPredicate
as the top level plan in the sub-queries, otherwise it will return an error:https://github.com/apache/arrow-datafusion/blob/667f19ebad216b7592af5a91b70a24fb21c3bb64/datafusion/optimizer/src/decorrelate_where_in.rs#L151-L152
However, for limit subquery, the top level plan might be
Limit
which letdecorrelate_where_in
fail.To Reproduce
Set
skip_failed_rules
tofalse
and run the testsupport_limit_subquery
, you will fail on the test with the error messageExpected behavior
No error should be generated. At least, we can let
decorrelate_where_in
returnOk(None)
Additional context
No response
The text was updated successfully, but these errors were encountered: