Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashJoin with mode PartitionMode:CollectLeft has bug and can produce wrong result #4230

Closed
mingmwang opened this issue Nov 16, 2022 · 6 comments · Fixed by #4219
Closed

HashJoin with mode PartitionMode:CollectLeft has bug and can produce wrong result #4230

mingmwang opened this issue Nov 16, 2022 · 6 comments · Fixed by #4219
Labels
bug Something isn't working

Comments

@mingmwang
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.
The join result is wrong.

To Reproduce
Steps to reproduce the behavior:

#[tokio::test]
async fn equijoin_left_and_condition_from_right() -> Result<()> {
    let ctx = create_join_context("t1_id", "t2_id")?;
    let sql =
        "SELECT t1_id, t1_name, t2_name FROM t1 LEFT JOIN t2 ON t1_id = t2_id AND t2_name >= 'y' ORDER BY t1_id";
    let res = ctx.create_logical_plan(sql);
    assert!(res.is_ok());
    let actual = execute_to_batches(&ctx, sql).await;
    let expected = vec![
        "+-------+---------+---------+",
        "| t1_id | t1_name | t2_name |",
        "+-------+---------+---------+",
        "| 11    | a       | z       |",
        "| 22    | b       | y       |",
        "| 33    | c       |         |",
        "| 44    | d       |         |",
        "+-------+---------+---------+",
    ];
    assert_batches_eq!(expected, &actual);

    Ok(())
}

Run this test with PartitionMode:CollectLeft, the test will be failed.

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

@mingmwang mingmwang added the bug Something isn't working label Nov 16, 2022
@mingmwang
Copy link
Contributor Author

@alamb @tustvold

@mingmwang
Copy link
Contributor Author

expected:

[
"+-------+---------+---------+",
"| t1_id | t1_name | t2_name |",
"+-------+---------+---------+",
"| 11 | a | z |",
"| 22 | b | y |",
"| 33 | c | |",
"| 44 | d | |",
"+-------+---------+---------+",
]
actual:

[
"+-------+---------+---------+",
"| t1_id | t1_name | t2_name |",
"+-------+---------+---------+",
"| 11 | a | |",
"| 11 | a | z |",
"| 11 | a | |",
"| 11 | a | |",
"| 11 | a | |",
"| 11 | a | |",
"| 11 | a | |",
"| 11 | a | |",
"| 22 | b | |",
"| 22 | b | |",
"| 22 | b | |",
"| 22 | b | |",
"| 22 | b | |",
"| 22 | b | |",
"| 22 | b | |",
"| 22 | b | y |",
"| 33 | c | |",
"| 33 | c | |",
"| 33 | c | |",
"| 33 | c | |",
"| 33 | c | |",
"| 33 | c | |",
"| 33 | c | |",
"| 33 | c | |",
"| 44 | d | |",
"| 44 | d | |",
"| 44 | d | |",
"| 44 | d | |",
"| 44 | d | |",
"| 44 | d | |",
"| 44 | d | |",
"| 44 | d | |",
"+-------+---------+---------+",
]

@mingmwang
Copy link
Contributor Author

The root cause is that, for Left Out join, for each partition of the right side, they are running independently, each of them constructs the HashJoinStream and has the visited_left_side data structure, each of them will try to produce the unmatched right side and fill with null, which will cause duplications.

@mingmwang
Copy link
Contributor Author

mingmwang commented Nov 16, 2022

@yahoNanJing @liukun4515

@mingmwang
Copy link
Contributor Author

mingmwang commented Nov 16, 2022

For DataFusion, a possible fix is to maintain a global visited_left_side data structure.
And for Ballista, we can not rely on any global structure, because different partitions are different tasks and can run on different machines, we can not run Collect Left mode for LeftJoin/Left Anti/FullJoin.

@mingmwang
Copy link
Contributor Author

I will fix the bug in the new JoinSelection Rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant