-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hash join for nested types #11232
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good. It'd be better if we can add some e2e tests in sqllogictest.
@viirya Good suggestion. I tried added some sqllogictest, they work. But as far as I can tell they are not using HashJoinExec (even if the exact same query joining on only field is). Is there some way to force a HashJoin or help me understand why hashjoin is not used? |
Hmm, what join operator it is using? I think HashJoin is used by default. |
Just using
|
Reading the explain more closely I see that the reason for it not using HashJoin seems to be that it fails to extract an equi-join condition when it is joining on structs. |
Seems like we need to add struct here: https://github.com/eejbyfeldt/datafusion/blob/2b851c5bbdaef7d1d78d91833d2a7e8d94809bc8/datafusion/expr/src/utils.rs#L830-L833 for hash join to be used. |
# Join on struct | ||
query ?? | ||
select join_t3.s3, join_t4.s4 | ||
from join_t3 | ||
inner join join_t4 on join_t3.s3 = join_t4.s4 | ||
---- | ||
{id: 2} {id: 2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add an explain
query to make sure it is using HashJoin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for this query. The one using IS NOT DISTINCT FROM
is still not using hash join. But that does not seem to be related to the struct, since it the same if just joining on a int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so the query using IS NOT DISTINCT FROM
is actually not testing HashJoin path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly not. If we made some other improvements like #11272 it would.
But for not I just changed the test case to use EXCEPT
instead and test that path. (The test case fails on master without the fixes in the branches)
77ba24a
to
47a2e02
Compare
Co-authored-by: Liang-Chi Hsieh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work @eejbyfeldt
* Fixes to 10749 and generalization * Add e2e tests for joins on struct * PR comments * Add Struct to can_hash method * Add explain query as well * Use EXCEPT to trigger failure * Update datafusion/sqllogictest/test_files/joins.slt Co-authored-by: Liang-Chi Hsieh <[email protected]> --------- Co-authored-by: Liang-Chi Hsieh <[email protected]>
* Fixes to 10749 and generalization * Add e2e tests for joins on struct * PR comments * Add Struct to can_hash method * Add explain query as well * Use EXCEPT to trigger failure * Update datafusion/sqllogictest/test_files/joins.slt Co-authored-by: Liang-Chi Hsieh <[email protected]> --------- Co-authored-by: Liang-Chi Hsieh <[email protected]>
Which issue does this PR close?
Does some follow up to #10749 and it PR #11117 and follow up #11149.
Closes #11233
Rationale for this change
In #11117 hash join was extended to handled nested types. The fixed contained a small issues around handling of
null_equals_null
where it only used the new compare function whennull_equals_null
was true. But it also used theOperator::Eq
which does not consider nulls equal.What changes are included in this PR?
This address the 2 described issues. Makes hash join also support nested types when
null_equals_null
is false and use the correctOperator
when it is true.Are these changes tested?
New test cases.
Are there any user-facing changes?
Yes, it fixes bugs in unreleased features.