Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Incorrect LEFT JOIN evaluation result on OR conditions #11203

Merged
merged 6 commits into from
Jul 3, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Jul 1, 2024

Which issue does this PR close?

Closes #10881.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jul 1, 2024
@@ -441,11 +442,11 @@ fn push_down_all_join(

// Extract from OR clause, generate new predicates for both side of join if possible.
// We only track the unpushable predicates above.
if left_preserved {
if on_left_preserved {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep_predicates, join_conditions are join filter but not predicates from top Filter, we should use on_lr_is_preserved instead of lr_is_preserved.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @viirya (and everyone else who reviewed and commented on this bug)

I went through the code carefully and I pushed a few more tests to this branch. I think it looks great and makes sense to me.

I will also make a new PR to tweak some comments as well

@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

The CI error appears to be #11210 unrelated to this PR

left_push.extend(extract_or_clauses_for_join(&keep_predicates, left_schema));
left_push.extend(extract_or_clauses_for_join(&join_conditions, left_schema));
}
if right_preserved {
if on_right_preserved {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed the conditions, filter predicates and on expression can be pushed down with @ozankabak .
According to us, correct handling seems to be

    if left_preserved {
        left_push.extend(extract_or_clauses_for_join(&keep_predicates, left_schema));
    }
    if on_left_preserved {
        left_push.extend(extract_or_clauses_for_join(&join_conditions, left_schema));
    }
    if right_preserved {
        right_push.extend(extract_or_clauses_for_join(&keep_predicates, right_schema));
    }
    if on_right_preserved {
        right_push.extend(extract_or_clauses_for_join(&join_conditions, right_schema));
    }

where, for predicates and join conditions flag used is different. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand what is different about your suggestion.

My understanding of the code is that to breaks the predicates into:

  • keep_predicates predicates that must be left as a filter (above the join)
  • join_conditions predicates that must be left in the join (e.g. ON clause)
  • left_push predicates that can be pushed down to the left input
  • right_push predicates that can be pushed down to the right input

I think the code, as written, does push predicates from the ON clause down correctly when possible.

So in other words, I am not sure what changes you are suggesting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be missing something but it seems like keep_predicates and join_conditions should be using preserved vs. on_preserved (respectively) since the former applies as a post-join filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For left join, left_preserved=true and on_left_preserved=false.
In current version of the code, nothing from keep_predicates and join_conditions can be inserted to the left_push since on_left_preserved flag is false. However, if we use the flag left_preserved to decide whether we can insert something from keep_predicates, into left_push. We could have inserted additional conditions to the left_push. In this sense, current code works correctly, however behaves suboptimal in some cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we wrongly mix join filter predicates and filter predicates into join_conditions. I separate them now so only on_filter_join_conditions will be considered with on_preserved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarifications

In current version of the code, nothing from keep_predicates and join_conditions can be inserted to the left_push since on_left_preserved flag is false

I agree, but I think this is required for correctness. At least for join_conditions it is important not to push them down to the preserved side

For example, in this query I am pretty sure it is not correct to push the predicate on A.a below the join (the intuition is that all rows from A have to have at least one row out of the join). I believe this is what this PR fixes.

SELECT .. FROM A LEFT JOIN B ON (A.a > 5)

I think is correct to push predicates on B below (as any non matching rows would have been filtered in the join anyways)

SELECT .. FROM A LEFT JOIN B ON (B.b > 5)

We could have inserted additional conditions to the left_push. In this sense, current code works correctly, however behaves suboptimal in some cases.

Can you help me write an example? I can add it to this PR and then we can handle improving the case in a future PR.

Copy link
Member Author

@viirya viirya Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add a test case with it.

EDIT: Seems it doesn't test against the corner case. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we found one example -- @mustafasrepo will post shortly

@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

Looks like my comment #11203 (comment) was in progress when @viirya changed the code in 17b2d43

Can someone help me understand what additional case this would catch (and I'll go and write some tests that cover it)?

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you made a typo in your last commit (left an inline comment for it), but other than that I think it is now logically equivalent to @mustafasrepo's suggestion on #11203 (comment) (although his code seems to use a single join_conditions vector and seems more concise).

left_push.extend(extract_or_clauses_for_join(&keep_predicates, left_schema));
left_push.extend(extract_or_clauses_for_join(&join_conditions, left_schema));
}
if on_right_preserved {
if left_preserved {
Copy link
Contributor

@ozankabak ozankabak Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant right_preserved here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops

@mustafasrepo
Copy link
Contributor

Looks like my comment #11203 (comment) was in progress when @viirya changed the code in 17b2d43

Can someone help me understand what additional case this would catch (and I'll go and write some tests that cover it)?

I have constructed an example that triggers this test case (query is a bit weird. I have constructed it to trigger extract_or_clauses_for_join call):

explain SELECT e.emp_id, e.name, d.dept_name
FROM employees AS e
LEFT JOIN department AS d
ON e.emp_id = d.emp_id
WHERE ((dept_name != 'Engineering' AND e.name = 'Alice') OR (name != 'Alice' AND e.name = 'Carol'));

this query generates following plan with recent changes

+   logical_plan
+   01)Filter: d.dept_name != Utf8("Engineering") AND e.name = Utf8("Alice") OR e.name != Utf8("Alice") AND e.name = Utf8("Carol")
+   02)--Projection: e.emp_id, e.name, d.dept_name
+   03)----Left Join: e.emp_id = d.emp_id
+   04)------SubqueryAlias: e
+   05)--------Filter: employees.name = Utf8("Alice") OR employees.name != Utf8("Alice") AND employees.name = Utf8("Carol")
+   06)----------TableScan: employees projection=[emp_id, name]
+   07)------SubqueryAlias: d
+   08)--------TableScan: department projection=[emp_id, dept_name]

previously it was generating

+   logical_plan
+   01)Filter: d.dept_name != Utf8("Engineering") AND e.name = Utf8("Alice") OR e.name != Utf8("Alice") AND e.name = Utf8("Carol")
+   02)--Projection: e.emp_id, e.name, d.dept_name
+   03)----Left Join: e.emp_id = d.emp_id
+   04)------SubqueryAlias: e
+   05)--------TableScan: employees projection=[emp_id, name]
+   06)------SubqueryAlias: d
+   07)--------TableScan: department projection=[emp_id, dept_name]

where both are correct. However, first plan is a bit better

@viirya
Copy link
Member Author

viirya commented Jul 2, 2024

I think you made a typo in your last commit (left an inline comment for it), but other than that I think it is now logically equivalent to @mustafasrepo's suggestion on #11203 (comment) (although his code seems to use a single join_conditions vector and seems more concise).

Yea, I think it is correct. I made the change to make it more obvious to more people. Because they need to figure out that join_conditions can only contains parent Filter predicates for inner join and for inner join, both sides are preserved (either preserved or on_preserved). So for join_conditions, we can simply use on_preserved.

I can use the suggestion if you think it is better.

@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

where both are correct. However, first plan is a bit better

Thank you @mustafasrepo - that makes sense to me. Let's also add that example to the tests (I can do so in a follow on PR or if we are going to change this PR again we can do so)

@mustafasrepo
Copy link
Contributor

I can use the suggestion if you think it is better.

As long as we have the functionality, both versions are fine.

@viirya
Copy link
Member Author

viirya commented Jul 2, 2024

Thanks @mustafasrepo @ozankabak @alamb . I added the test case.

1 Alice HR
3 Carol Engineering

# Push down OR conditions on Filter through LEFT JOIN if possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty fancy 🚀

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great collaboration!

@ozankabak
Copy link
Contributor

Should we go ahead and merge? Are we waiting for more feedback?

@alamb alamb merged commit 03c8db0 into apache:main Jul 3, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 3, 2024

🚀 I agree it was time to merge

Thanks again everyone

@viirya viirya deleted the fix_join_filter_pushdown branch July 3, 2024 16:17
comphead pushed a commit to comphead/arrow-datafusion that referenced this pull request Jul 8, 2024
…1203)

* fix: Incorrect LEFT JOIN evaluation result on OR conditions

* Add a few more test cases

* Don't push join filter predicates into join_conditions

* Add test case and fix typo

* Add test case

---------

Co-authored-by: Andrew Lamb <[email protected]>
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 11, 2024
…1203)

* fix: Incorrect LEFT JOIN evaluation result on OR conditions

* Add a few more test cases

* Don't push join filter predicates into join_conditions

* Add test case and fix typo

* Add test case

---------

Co-authored-by: Andrew Lamb <[email protected]>
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 12, 2024
…1203)

* fix: Incorrect LEFT JOIN evaluation result on OR conditions

* Add a few more test cases

* Don't push join filter predicates into join_conditions

* Add test case and fix typo

* Add test case

---------

Co-authored-by: Andrew Lamb <[email protected]>
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…1203)

* fix: Incorrect LEFT JOIN evaluation result on OR conditions

* Add a few more test cases

* Don't push join filter predicates into join_conditions

* Add test case and fix typo

* Add test case

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect LEFT JOIN evaluation result on OR conditions
4 participants