-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Supported nested types in HashJoin #735
fix: Supported nested types in HashJoin #735
Conversation
3c8c238
to
781a685
Compare
@@ -192,6 +192,13 @@ class CometJoinSuite extends CometTestBase { | |||
|
|||
// DataFusion HashJoin LeftAnti has bugs in handling nulls and is disabled for now. | |||
// left.join(right, left("_2") === right("_1"), "leftanti") | |||
|
|||
// Full join: struct key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test. I wonder if we should make this more comprehensive to cover structs containing different types, nulls, and nested structs?
Also, what happens with structs containing unsupported types such as array and map? Do we still fall back for those? It would be good to have a test for this case as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call will add more test cases. Created #797 to make it easier to create nulls of struct type.
Also, what happens with structs containing unsupported types such as array and map?
Do you mean unsupported by comet here? I think the answer for map
is that spark does not support joining on map
so it will fail during the Spark planning stage. I will add a test for array by I guess that requires something like #793 first since we do not support reading arrays from parquet. Or is there some other way testing with arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about map not being supported by Spark.
I think we should fall back for array for now because we don't really support array yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do fallback currently. Is there someway or even desired to add a test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could improve our test framework to make it easier to test for fallback but it is possible with code like this (must be used after calling collect on a DataFrame). I think we can improve the tests in a future PR.
val str =
new ExtendedExplainInfo().generateExtendedInfo(df.queryExecution.executedPlan)
assert(str.contains(expectedMessage))
781a685
to
7fac289
Compare
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @eejbyfeldt
Needs to be ignored for the same reason as the other specs in the DynamicPartitionPruningSuiteBase. But previously we did not hit the issue as we did not support the join being done.
ae9c6ca
to
c4f2eee
Compare
* fix: Supported nested types in HashJoin * Update diffs ignore new failing specs Needs to be ignored for the same reason as the other specs in the DynamicPartitionPruningSuiteBase. But previously we did not hit the issue as we did not support the join being done. * Improve type support check * Improve tests * Remove unneeded supportedDataType guard (cherry picked from commit 9d4afc1)
Which issue does this PR close?
Closes #621 .
Rationale for this change
Allow more joins to be executed by comet.
What changes are included in this PR?
The latest Datafusion contains fix for doing joins with struct keys, so this PR removes that limitation. The Spark diffs had to be upgrade to exclude one more test case for the same reason as ther other tests in that test suite.
How are these changes tested?
Existing and new tests.