Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Incorrect LEFT JOIN evaluation result on OR conditions #11203

Merged
merged 6 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions datafusion/optimizer/src/push_down_filter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -424,8 +424,9 @@ fn push_down_all_join(
}
}

let (on_left_preserved, on_right_preserved) = on_lr_is_preserved(join.join_type)?;

if !on_filter.is_empty() {
let (on_left_preserved, on_right_preserved) = on_lr_is_preserved(join.join_type)?;
for on in on_filter {
if on_left_preserved && can_pushdown_join_predicate(&on, left_schema)? {
left_push.push(on)
Expand All @@ -441,11 +442,11 @@ fn push_down_all_join(

// Extract from OR clause, generate new predicates for both side of join if possible.
// We only track the unpushable predicates above.
if left_preserved {
if on_left_preserved {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep_predicates, join_conditions are join filter but not predicates from top Filter, we should use on_lr_is_preserved instead of lr_is_preserved.

left_push.extend(extract_or_clauses_for_join(&keep_predicates, left_schema));
left_push.extend(extract_or_clauses_for_join(&join_conditions, left_schema));
}
if right_preserved {
if on_right_preserved {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed the conditions, filter predicates and on expression can be pushed down with @ozankabak .
According to us, correct handling seems to be

    if left_preserved {
        left_push.extend(extract_or_clauses_for_join(&keep_predicates, left_schema));
    }
    if on_left_preserved {
        left_push.extend(extract_or_clauses_for_join(&join_conditions, left_schema));
    }
    if right_preserved {
        right_push.extend(extract_or_clauses_for_join(&keep_predicates, right_schema));
    }
    if on_right_preserved {
        right_push.extend(extract_or_clauses_for_join(&join_conditions, right_schema));
    }

where, for predicates and join conditions flag used is different. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand what is different about your suggestion.

My understanding of the code is that to breaks the predicates into:

  • keep_predicates predicates that must be left as a filter (above the join)
  • join_conditions predicates that must be left in the join (e.g. ON clause)
  • left_push predicates that can be pushed down to the left input
  • right_push predicates that can be pushed down to the right input

I think the code, as written, does push predicates from the ON clause down correctly when possible.

So in other words, I am not sure what changes you are suggesting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be missing something but it seems like keep_predicates and join_conditions should be using preserved vs. on_preserved (respectively) since the former applies as a post-join filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For left join, left_preserved=true and on_left_preserved=false.
In current version of the code, nothing from keep_predicates and join_conditions can be inserted to the left_push since on_left_preserved flag is false. However, if we use the flag left_preserved to decide whether we can insert something from keep_predicates, into left_push. We could have inserted additional conditions to the left_push. In this sense, current code works correctly, however behaves suboptimal in some cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we wrongly mix join filter predicates and filter predicates into join_conditions. I separate them now so only on_filter_join_conditions will be considered with on_preserved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarifications

In current version of the code, nothing from keep_predicates and join_conditions can be inserted to the left_push since on_left_preserved flag is false

I agree, but I think this is required for correctness. At least for join_conditions it is important not to push them down to the preserved side

For example, in this query I am pretty sure it is not correct to push the predicate on A.a below the join (the intuition is that all rows from A have to have at least one row out of the join). I believe this is what this PR fixes.

SELECT .. FROM A LEFT JOIN B ON (A.a > 5)

I think is correct to push predicates on B below (as any non matching rows would have been filtered in the join anyways)

SELECT .. FROM A LEFT JOIN B ON (B.b > 5)

We could have inserted additional conditions to the left_push. In this sense, current code works correctly, however behaves suboptimal in some cases.

Can you help me write an example? I can add it to this PR and then we can handle improving the case in a future PR.

Copy link
Member Author

@viirya viirya Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add a test case with it.

EDIT: Seems it doesn't test against the corner case. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we found one example -- @mustafasrepo will post shortly

right_push.extend(extract_or_clauses_for_join(&keep_predicates, right_schema));
right_push.extend(extract_or_clauses_for_join(&join_conditions, right_schema));
}
Expand Down
73 changes: 73 additions & 0 deletions datafusion/sqllogictest/test_files/join.slt
Original file line number Diff line number Diff line change
Expand Up @@ -793,3 +793,76 @@ DROP TABLE companies

statement ok
DROP TABLE leads

# create tables
statement ok
CREATE TABLE employees(emp_id INT, name VARCHAR);

statement ok
CREATE TABLE department(emp_id INT, dept_name VARCHAR);

statement ok
INSERT INTO employees (emp_id, name) VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Carol');

statement ok
INSERT INTO department (emp_id, dept_name) VALUES (1, 'HR'), (3, 'Engineering'), (4, 'Sales');

query TT
EXPLAIN SELECT e.emp_id, e.name, d.dept_name
FROM employees AS e
LEFT JOIN department AS d
ON (e.name = 'Alice' OR e.name = 'Bob');
----
logical_plan
01)Left Join: Filter: e.name = Utf8("Alice") OR e.name = Utf8("Bob")
02)--SubqueryAlias: e
03)----TableScan: employees projection=[emp_id, name]
04)--SubqueryAlias: d
05)----TableScan: department projection=[dept_name]
physical_plan
01)ProjectionExec: expr=[emp_id@1 as emp_id, name@2 as name, dept_name@0 as dept_name]
02)--NestedLoopJoinExec: join_type=Right, filter=name@0 = Alice OR name@0 = Bob
03)----MemoryExec: partitions=1, partition_sizes=[1]
04)----MemoryExec: partitions=1, partition_sizes=[1]

query ITT
SELECT e.emp_id, e.name, d.dept_name
FROM employees AS e
LEFT JOIN department AS d
ON (e.name = 'Alice' OR e.name = 'Bob');
----
1 Alice HR
2 Bob HR
1 Alice Engineering
2 Bob Engineering
1 Alice Sales
2 Bob Sales
3 Carol NULL

query ITT
SELECT e.emp_id, e.name, d.dept_name
FROM employees e
LEFT JOIN department d
ON (e.name = 'NotExist1' OR e.name = 'NotExist2');
----
1 Alice NULL
2 Bob NULL
3 Carol NULL

query ITT
SELECT e.emp_id, e.name, d.dept_name
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'NotExist');
----
1 Alice HR
1 Alice Engineering
1 Alice Sales
2 Bob NULL
3 Carol NULL

statement ok
DROP TABLE employees

statement ok
DROP TABLE department
Loading