-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32330][SQL] Preserve shuffled hash join build side partitioning #29130
Conversation
cc @maropu, @cloud-fan, @gatorsmile and @sameeragarwal if you guys can help take a look. Thanks! |
Test build #125951 has finished for PR 29130 at commit
|
@cloud-fan and @imback82 - I was not aware of #28676 before making this PR. After checking #28676, TLDR is I think we are solving different issues and there's no code conflict between these two. #28676: preserve broadcast hash join build side partitioning for inner join if the stream side is hash partitioned. @imback82 It's a great idea that I never thought about it before. I bet in production, out users should have hit this issue before, but I think our action was just asking them to disable broadcast join (SMJ on small table, instead of broadcasting it - the cost of it is small, as the table should be small enough to be broadcasted), then partitioning info gets propagated through query plan, and the followed shuffle can be saved. But I think #28676 handles the thing automatically on spark side, which should be better. this PR: preserve shuffled hash join build side partitioning, which is a much smaller trivial change compared to handle broadcast hash join. Because for required children distribution, shuffled hash join is same as sort merge join, shuffled hash join output partitioning should be same as sort merge join (except it cannot handle full outer join case). We found this issue when our users did shuffled hash join on bucketed tables, and had followed join / group-by on build side. We have run this in production for more than one years. There's no config needed for this feature, and IMO it could be enabled by default. What do you think? Thanks. |
@@ -47,6 +47,18 @@ case class ShuffledHashJoinExec( | |||
"buildDataSize" -> SQLMetrics.createSizeMetric(sparkContext, "data size of build side"), | |||
"buildTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to build hash map")) | |||
|
|||
override def outputPartitioning: Partitioning = joinType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the same as SMJ. Shall we create a common trait ShuffleJoin
to put it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - there's an extra case for sort merge join to handle full outer join. I am thinking to handle all other join types in parent trait ShuffleJoin
, and override outputPartitioning
in SortMergeJoinExec
to handle the extra FullOuter
? What do you think?
But for me it's kind of weird that ShuffleJoin
not handle FullOuter
as shuffled FullOuter
join is one kind of ShuffleJoin
. But if ShuffleJoin
handles FullOuter
, it seems to be also weird that ShuffledHashJoinExec
extends it.
Wondering what do you think? The change itself is easy. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does shuffle hash join not support FullOuter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does shuffle hash join not support FullOuter?
@cloud-fan sorry if I miss anything, but isn't this true now? Given current spark implementation for hash join, stream side looks up in build side hash map, it can handle non-matching keys from stream side if there's no match in build side hash map. But it cannot handle non-matching keys from build side, as there's no info persisted from stream side.
I feel an interesting followup could be to handle full outer join in shuffled hash join, where when looking up stream side keys from build side HashedRelation
. Mark this info inside build side HashedRelation
, and after reading all rows from stream side, output all non-matching rows from build side based on modified HashedRelation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply move SortMergeJoinExec.outputPartitioning
to the parent trait? It works for ShuffledHashJoinExec
as well, as the planner guarantees ShuffledHashJoinExec.joinType
won't be FullOuter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I agree. Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 as well
|
||
test("SPARK-32330: Preserve shuffled hash join build side partitioning") { | ||
withSQLConf( | ||
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: set it to "-1" to make the intention (turning off broadcast join) clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. Thanks!
Test build #126058 has finished for PR 29130 at commit
|
retest this please |
Test build #126081 has finished for PR 29130 at commit
|
thanks, merging to master! |
Thank you @cloud-fan for review! |
What changes were proposed in this pull request?
Currently
ShuffledHashJoin.outputPartitioning
inherits fromHashJoin.outputPartitioning
, which only preserves stream side partitioning (HashJoin.scala
):This loses build side partitioning information, and causes extra shuffle if there's another join / group-by after this join.
Example:
Current physical plan (having an extra shuffle on
k1
before aggregate)Ideal physical plan (no shuffle on
k1
before aggregate)This can be fixed by overriding
outputPartitioning
method inShuffledHashJoinExec
, similar toSortMergeJoinExec
.In addition, also fix one typo in
HashJoin
, as that code path is shared between broadcast hash join and shuffled hash join.Why are the changes needed?
To avoid shuffle (for queries having multiple joins or group-by), for saving CPU and IO.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added unit test in
JoinSuite
.