fix: Supported nested types in HashJoin #735

eejbyfeldt · 2024-07-29T07:28:24Z

Which issue does this PR close?

Closes #621 .

Rationale for this change

Allow more joins to be executed by comet.

What changes are included in this PR?

The latest Datafusion contains fix for doing joins with struct keys, so this PR removes that limitation. The Spark diffs had to be upgrade to exclude one more test case for the same reason as ther other tests in that test suite.

How are these changes tested?

Existing and new tests.

andygrove · 2024-08-08T00:15:51Z

spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala

@@ -192,6 +192,13 @@ class CometJoinSuite extends CometTestBase {

          // DataFusion HashJoin LeftAnti has bugs in handling nulls and is disabled for now.
          // left.join(right, left("_2") === right("_1"), "leftanti")
+
+          // Full join: struct key


Thanks for adding the test. I wonder if we should make this more comprehensive to cover structs containing different types, nulls, and nested structs?

Also, what happens with structs containing unsupported types such as array and map? Do we still fall back for those? It would be good to have a test for this case as well.

Good call will add more test cases. Created #797 to make it easier to create nulls of struct type.

Also, what happens with structs containing unsupported types such as array and map?

Do you mean unsupported by comet here? I think the answer for map is that spark does not support joining on map so it will fail during the Spark planning stage. I will add a test for array by I guess that requires something like #793 first since we do not support reading arrays from parquet. Or is there some other way testing with arrays?

Good point about map not being supported by Spark.

I think we should fall back for array for now because we don't really support array yet.

We do fallback currently. Is there someway or even desired to add a test for that?

I think we could improve our test framework to make it easier to test for fallback but it is possible with code like this (must be used after calling collect on a DataFrame). I think we can improve the tests in a future PR.

val str = new ExtendedExplainInfo().generateExtendedInfo(df.queryExecution.executedPlan) assert(str.contains(expectedMessage))

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

andygrove

LGTM. Thanks @eejbyfeldt

Needs to be ignored for the same reason as the other specs in the DynamicPartitionPruningSuiteBase. But previously we did not hit the issue as we did not support the join being done.

* fix: Supported nested types in HashJoin * Update diffs ignore new failing specs Needs to be ignored for the same reason as the other specs in the DynamicPartitionPruningSuiteBase. But previously we did not hit the issue as we did not support the join being done. * Improve type support check * Improve tests * Remove unneeded supportedDataType guard (cherry picked from commit 9d4afc1)

eejbyfeldt force-pushed the i621-nested-type-in-hash-join branch from 3c8c238 to 781a685 Compare August 7, 2024 07:30

eejbyfeldt marked this pull request as ready for review August 7, 2024 17:54

andygrove reviewed Aug 8, 2024

View reviewed changes

Kimahriman mentioned this pull request Aug 8, 2024

feat: CreateArray support #793

Merged

eejbyfeldt mentioned this pull request Aug 8, 2024

feat: Add support for null literal with struct type #797

Merged

viirya reviewed Aug 8, 2024

View reviewed changes

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Show resolved Hide resolved

eejbyfeldt force-pushed the i621-nested-type-in-hash-join branch from 781a685 to 7fac289 Compare August 12, 2024 07:00

andygrove reviewed Aug 12, 2024

View reviewed changes

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Outdated Show resolved Hide resolved

andygrove approved these changes Aug 12, 2024

View reviewed changes

eejbyfeldt added 5 commits August 13, 2024 09:26

fix: Supported nested types in HashJoin

1e93be4

Update diffs ignore new failing specs

9b334e0

Needs to be ignored for the same reason as the other specs in the DynamicPartitionPruningSuiteBase. But previously we did not hit the issue as we did not support the join being done.

Improve type support check

bd3e92d

Improve tests

c1ab4f4

Remove unneeded supportedDataType guard

c4f2eee

eejbyfeldt force-pushed the i621-nested-type-in-hash-join branch from ae9c6ca to c4f2eee Compare August 13, 2024 07:26

andygrove merged commit 9d4afc1 into apache:main Aug 13, 2024
74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Supported nested types in HashJoin #735

fix: Supported nested types in HashJoin #735

eejbyfeldt commented Jul 29, 2024 •

edited

Loading

andygrove Aug 8, 2024

eejbyfeldt Aug 8, 2024

andygrove Aug 12, 2024

eejbyfeldt Aug 13, 2024

andygrove Aug 13, 2024

andygrove left a comment

fix: Supported nested types in HashJoin #735

fix: Supported nested types in HashJoin #735

Conversation

eejbyfeldt commented Jul 29, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove Aug 8, 2024

Choose a reason for hiding this comment

eejbyfeldt Aug 8, 2024

Choose a reason for hiding this comment

andygrove Aug 12, 2024

Choose a reason for hiding this comment

eejbyfeldt Aug 13, 2024

Choose a reason for hiding this comment

andygrove Aug 13, 2024

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

eejbyfeldt commented Jul 29, 2024 •

edited

Loading