-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35350][SQL] Add code-gen for left semi sort merge join #32528
Conversation
cc @cloud-fan and @maropu could you help take a look when you have time? Thanks. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left minor comments and it looks otherwise. Thank you, @c21
@@ -724,9 +749,32 @@ case class SortMergeJoinExec( | |||
""".stripMargin | |||
} | |||
|
|||
lazy val semiJoin = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about extracting this block as a private method like codegenXXXX
just like HashJoin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu - yes I was thinking at the first place but worried about number of parameters to be too many. Refined the code a bit and updated now.
case _ => false | ||
} | ||
val inMemoryThreshold = | ||
if (onlyBufferFirstMatchedRow) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving this branch into the getInMemoryThreshold
side?
// Flag to only buffer first matched row, to avoid buffering unnecessary rows.
private lazy val onlyBufferFirstMatchedRow = (joinType, condition) match {
case (LeftSemi, None) => true
case _ => false
}
private def getInMemoryThreshold: Int = {
if (onlyBufferFirstMatchedRow) {
1
} else {
sqlContext.conf.sortMergeJoinExecBufferInMemoryThreshold
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, lazy val
can probably be def
as the logic is super simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Actually the non-code-gen path can also depend on this, so I make it just a val
now.
Test build #138476 has finished for PR 32528 at commit
|
To ease for review, the change for all plan files is used by followed command:
None of them are updated manually. |
case _: InnerLike => innerJoin | ||
case LeftOuter | RightOuter => outerJoin | ||
case _: InnerLike => | ||
codegenInner(findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, outputRow, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we pass beforeLoop.trim
so that we don't need to do it in all the 3 methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually after double checking, we do not need to do beforeLoop.trim
as beforeLoop
already has stripMargin
, and has no trailing spaces. Also updated to avoid repeated conditionCheck.trim
s""" | ||
|while ($findNextJoinRows) { | ||
| ${beforeLoop.trim} | ||
| boolean $hasOutputRow = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this flag if we are sure matchIterator
has at most one element?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - matchIterator
will only has at most one element if join condition is empty. So yes we don't need this if join condition is empty. But consider the extra code is just a while loop check on hasOutputRow
, and set value of hasOutputRow
, I don't see much value to specialize another code-gen for left semi join without join condition. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, let's keep it
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #138500 has finished for PR 32528 at commit
|
thanks, merging to master! |
Test build #138506 has finished for PR 32528 at commit
|
Thank you @cloud-fan and @maropu for review! |
What changes were proposed in this pull request?
As title. This PR is to add code-gen support for LEFT SEMI sort merge join. The main change is to add
semiJoin
code path inSortMergeJoinExec.doProduce()
and introduceonlyBufferFirstMatchedRow
inSortMergeJoinExec.genScanner()
. The latter is for left semi sort merge join without condition. For this kind of query, we don't need to buffer all matched rows, but only the first one (this is same as non-code-gen code path).Example query:
Example of generated code for the query:
Why are the changes needed?
Improve query CPU performance. Test with one query:
Seeing 30% of run-time improvement:
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added unit test in
WholeStageCodegenSuite.scala
andExistenceJoinSuite.scala
.