-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32330][SQL] Preserve shuffled hash join build side partitioning #29130
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ import org.apache.spark.sql.catalyst.expressions.{Ascending, GenericRow, SortOrd | |
import org.apache.spark.sql.catalyst.plans.logical.Filter | ||
import org.apache.spark.sql.execution.{BinaryExecNode, FilterExec, SortExec, SparkPlan} | ||
import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper | ||
import org.apache.spark.sql.execution.exchange.ShuffleExchangeExec | ||
import org.apache.spark.sql.execution.joins._ | ||
import org.apache.spark.sql.execution.python.BatchEvalPythonExec | ||
import org.apache.spark.sql.internal.SQLConf | ||
|
@@ -1086,4 +1087,21 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan | |
assert(df2.join(df1, "id").collect().isEmpty) | ||
} | ||
} | ||
|
||
test("SPARK-32330: Preserve shuffled hash join build side partitioning") { | ||
withSQLConf( | ||
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: set it to "-1" to make the intention (turning off broadcast join) clear? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah OK. Thanks! |
||
SQLConf.SHUFFLE_PARTITIONS.key -> "2", | ||
SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { | ||
val df1 = spark.range(10).select($"id".as("k1")) | ||
val df2 = spark.range(30).select($"id".as("k2")) | ||
Seq("inner", "cross").foreach(joinType => { | ||
val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() | ||
.queryExecution.executedPlan | ||
assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) | ||
// No extra shuffle before aggregate | ||
assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) | ||
}) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the same as SMJ. Shall we create a common trait
ShuffleJoin
to put it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - there's an extra case for sort merge join to handle full outer join. I am thinking to handle all other join types in parent trait
ShuffleJoin
, and overrideoutputPartitioning
inSortMergeJoinExec
to handle the extraFullOuter
? What do you think?But for me it's kind of weird that
ShuffleJoin
not handleFullOuter
as shuffledFullOuter
join is one kind ofShuffleJoin
. But ifShuffleJoin
handlesFullOuter
, it seems to be also weird thatShuffledHashJoinExec
extends it.Wondering what do you think? The change itself is easy. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does shuffle hash join not support
FullOuter
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan sorry if I miss anything, but isn't this true now? Given current spark implementation for hash join, stream side looks up in build side hash map, it can handle non-matching keys from stream side if there's no match in build side hash map. But it cannot handle non-matching keys from build side, as there's no info persisted from stream side.
I feel an interesting followup could be to handle full outer join in shuffled hash join, where when looking up stream side keys from build side
HashedRelation
. Mark this info inside build sideHashedRelation
, and after reading all rows from stream side, output all non-matching rows from build side based on modifiedHashedRelation
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply move
SortMergeJoinExec.outputPartitioning
to the parent trait? It works forShuffledHashJoinExec
as well, as the planner guaranteesShuffledHashJoinExec.joinType
won't beFullOuter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I agree. Updated.