Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Supported nested types in HashJoin #735

Merged
merged 5 commits into from
Aug 13, 2024

Conversation

eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Jul 29, 2024

Which issue does this PR close?

Closes #621 .

Rationale for this change

Allow more joins to be executed by comet.

What changes are included in this PR?

The latest Datafusion contains fix for doing joins with struct keys, so this PR removes that limitation. The Spark diffs had to be upgrade to exclude one more test case for the same reason as ther other tests in that test suite.

How are these changes tested?

Existing and new tests.

@eejbyfeldt eejbyfeldt force-pushed the i621-nested-type-in-hash-join branch from 3c8c238 to 781a685 Compare August 7, 2024 07:30
@eejbyfeldt eejbyfeldt marked this pull request as ready for review August 7, 2024 17:54
@@ -192,6 +192,13 @@ class CometJoinSuite extends CometTestBase {

// DataFusion HashJoin LeftAnti has bugs in handling nulls and is disabled for now.
// left.join(right, left("_2") === right("_1"), "leftanti")

// Full join: struct key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test. I wonder if we should make this more comprehensive to cover structs containing different types, nulls, and nested structs?

Also, what happens with structs containing unsupported types such as array and map? Do we still fall back for those? It would be good to have a test for this case as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call will add more test cases. Created #797 to make it easier to create nulls of struct type.

Also, what happens with structs containing unsupported types such as array and map?

Do you mean unsupported by comet here? I think the answer for map is that spark does not support joining on map so it will fail during the Spark planning stage. I will add a test for array by I guess that requires something like #793 first since we do not support reading arrays from parquet. Or is there some other way testing with arrays?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about map not being supported by Spark.

I think we should fall back for array for now because we don't really support array yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do fallback currently. Is there someway or even desired to add a test for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could improve our test framework to make it easier to test for fallback but it is possible with code like this (must be used after calling collect on a DataFrame). I think we can improve the tests in a future PR.

 val str =
          new ExtendedExplainInfo().generateExtendedInfo(df.queryExecution.executedPlan)
        assert(str.contains(expectedMessage))

@eejbyfeldt eejbyfeldt force-pushed the i621-nested-type-in-hash-join branch from 781a685 to 7fac289 Compare August 12, 2024 07:00
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @eejbyfeldt

Needs to be ignored for the same reason as the other specs in the
DynamicPartitionPruningSuiteBase. But previously we did not hit the
issue as we did not support the join being done.
@eejbyfeldt eejbyfeldt force-pushed the i621-nested-type-in-hash-join branch from ae9c6ca to c4f2eee Compare August 13, 2024 07:26
@andygrove andygrove merged commit 9d4afc1 into apache:main Aug 13, 2024
74 checks passed
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* fix: Supported nested types in HashJoin

* Update diffs ignore new failing specs

Needs to be ignored for the same reason as the other specs in the
DynamicPartitionPruningSuiteBase. But previously we did not hit the
issue as we did not support the join being done.

* Improve type support check

* Improve tests

* Remove unneeded supportedDataType guard

(cherry picked from commit 9d4afc1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support nested types in HashJoin
3 participants