`decorrelate_where_in` reports error when optimizing `limit subquery` #5808

HaoYang670 · 2023-03-31T07:50:20Z

Describe the bug

decorrelate_where_in currently only support Predicate as the top level plan in the sub-queries, otherwise it will return an error:
https://github.com/apache/arrow-datafusion/blob/667f19ebad216b7592af5a91b70a24fb21c3bb64/datafusion/optimizer/src/decorrelate_where_in.rs#L151-L152

However, for limit subquery, the top level plan might be Limit which let decorrelate_where_in fail.

To Reproduce

Set skip_failed_rules to false and run the test support_limit_subquery, you will fail on the test with the error message

Error: Context("decorrelate_where_in", Internal("Optimizer rule 'decorrelate_where_in' failed due to unexpected error: a projection is required at datafusion/optimizer/src/decorrelate_where_in.rs:152\ncaused by\nError during planning: Could not coerce into Projection! at datafusion/expr/src/logical_plan/plan.rs:1293"))
test sql::subqueries::support_limit_subquery ... FAILED

Expected behavior

No error should be generated. At least, we can let decorrelate_where_in return Ok(None)

Additional context

No response

The text was updated successfully, but these errors were encountered:

HaoYang670 · 2023-03-31T07:52:37Z

Hi @avantgardnerio @mingmwang, could you please help to take a look if you have time?

mingmwang · 2023-03-31T10:54:15Z

Sure, I will take a look. It is tricky to support decorate the correlated In/Exist subqueries which contains Limit/OrderBy clauses. I remember SparkSQL will report error in the case, PostgreSQL will not report error but will not decorate and just keep the nested subquery. Hyper is more advanced to handle this.

In DataFusion, if we want to support this, we need to think and test all the difference cases carefully:

-- Expected behavior: can be de-correlated, limit must be removed
explain  
SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE t2.id = t1.id limit 1);

-- Expected behavior: can be de-correlated, should keep the inner limit and must remove the outer limit
explain  
SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT * FROM (SELECT * FROM t2 limit 10) as t2 WHERE t2.id = t1.id limit 1);

-- Expected behavior: can be de-correlated, must keep the limit
explain 
SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 limit 10);

-- Expected behavior: can not be de-correlated, must keep limit
explain  
SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 where t1.name = t2.name limit 10)

mingmwang · 2023-03-31T11:09:48Z

Why it is tricky is because Subquery can be think of as a specific kind of nested loop join, the join condition is very specific and contains limit, the de-correlation process can be consider to push down the joins and covert the nested loop join to a normal join without limit as the join condition, it changes the evaluation ordering of the original operators, removing or keeping the limit in the re-written query will impact the correctness.

avantgardnerio · 2023-03-31T15:25:48Z

take a look if you have time

Unfortunately my availability is low right now. If @mingmwang 's claim is correct (which I have no reason to doubt) that:

SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 where t1.name = t2.name limit 10)

can not be de-correlated

then I think we'll need to have the ability to execute plans even if this rule fails (i.e. nested loop execution). I don't think I ever intended it to decorrelate all subqueries - it was designed to hit the 80% case and get TPC-H working.

At the time, returning an error was considered the proper thing to do. The API changed so now the rule needs to be updated to plumb Ok(None) down through all the layers of recursion, which can be verbose and non-trivial.

My recommendation at the time (which I would still assert) is that it would make the life of optimizer rule authors considerably simpler if we add a DataFusionError::CanNotOptimize error and simply return that in this case, which would get treated the same as Ok(None) so it keeps the code readable and simplifies plumbing.

avantgardnerio · 2023-04-03T17:35:58Z

Taking a look now... how do I:

Set skip_failed_rules to false

?

HaoYang670 · 2023-04-03T23:36:35Z

Hi @avantgardnerio, you could just set the default value to false:
https://github.com/apache/arrow-datafusion/blob/8eaa292d8a1adf777ab7817ccfa4d749de0c7e1c/datafusion/common/src/config.rs#L308

, and rerun cargo test

HaoYang670 added the bug Something isn't working label Mar 31, 2023

HaoYang670 mentioned this issue Mar 31, 2023

Fix the test failures related to the logical optimizer. #4685

Closed

12 tasks

This was referenced May 6, 2023

decorrelate_where_in doesn't support Sort/Limit in Subquery. #6263

Closed

[EPIC] More Subquery support #5483

Open

mingmwang mentioned this issue May 26, 2023

Support wider range of Subquery, handle the Count bug #6457

Merged

alamb closed this as completed in #6457 Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`decorrelate_where_in` reports error when optimizing `limit subquery` #5808

`decorrelate_where_in` reports error when optimizing `limit subquery` #5808

HaoYang670 commented Mar 31, 2023

HaoYang670 commented Mar 31, 2023

mingmwang commented Mar 31, 2023 •

edited

Loading

mingmwang commented Mar 31, 2023

avantgardnerio commented Mar 31, 2023

avantgardnerio commented Apr 3, 2023

HaoYang670 commented Apr 3, 2023

decorrelate_where_in reports error when optimizing limit subquery #5808

decorrelate_where_in reports error when optimizing limit subquery #5808

Comments

HaoYang670 commented Mar 31, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

HaoYang670 commented Mar 31, 2023

mingmwang commented Mar 31, 2023 • edited Loading

mingmwang commented Mar 31, 2023

avantgardnerio commented Mar 31, 2023

avantgardnerio commented Apr 3, 2023

HaoYang670 commented Apr 3, 2023

`decorrelate_where_in` reports error when optimizing `limit subquery` #5808

`decorrelate_where_in` reports error when optimizing `limit subquery` #5808

mingmwang commented Mar 31, 2023 •

edited

Loading