-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21041][SQL] SparkSession.range should be consistent with SparkContext.range #18257
Conversation
**BEFORE** ``` scala> spark.range(0,0,1).explain == Physical Plan == *Range (0, 0, step=1, splits=8) ``` **AFTER** ``` scala> spark.range(0,0,1).explain == Physical Plan == LocalTableScan <empty>, [id#0L] ```
Test build #77865 has finished for PR 18257 at commit
|
The end result looks good to me. Thanks for fixing it! Although, I'd prefer fixing the actual integer overflow handling in |
Thanks. For me, this optimizer does more than the fixing. You can make another PR for the underlying codegen if you want. |
Interesting discussion. |
Sure! First of all, this handles another invalid case, Range(0,0). Second, although we fix the generated codes in the reported in SPARK-21041, we still generate No-Op code. Third, this PR removes the usuless No-Op codegen and Janino compilation and so on. This also will cowork with the other optimizer like PripagateEmptyRelation. Finally, we can simplify the underlying logic later by preventing this kind of corner cases as eariler as possible. |
Hi, @cloud-fan and @gatorsmile . |
@dongjoon-hyun thank you. IMHO, the codegen can handle the first point. However, the second and third points make sense. The codegen cannot achieve. |
Should we just invalidate such cases? Like we do for step cannot be 0. |
That's also a good another candidate. In this PR, I just didn't want to surprise the users with the behavior's changes. |
@dongjoon-hyun Conceptually, we should not fix the correctness by adding optimization rules.
|
If then, I can remove the SPARK-21041 from the title and PR description in order to reduce the scope. Also, I can remove the test case too. |
I want to make another PR for the fixing bug. Is it okay for you all? |
Hmm. Ah, sorry. I'll update this PR since the optimizer is not valid anymore. |
@gatorsmile Are you sure to raise exception on empty range, Range(0, 0), really? |
Yea, let's update this pr so that previous discussions stay in the same thread. |
Yep, I'm updating here, @wzhfy . |
@dongjoon-hyun Based on our definition of range, since the default step is 1, I think range(0, 0) is invalid. |
The problem is Spark RDD supports that in that way until now. We cannot raise exceptions on Dataset operations inconsistently.
|
Sorry it doesn't make sense to do this. Range is used primarily for testing, and it doesn't make sense to have an optimizer rule that removes it. If there is a correctness issue in it, we should fix that. And it is perfectly fine to generate no-op code if the query is no-op. |
@rxin . Yep. The scope is changed like that. |
Thank you for review again, @rednaxelafx , @kiszk , @wzhfy , @gatorsmile , @rxin . |
@@ -482,11 +482,17 @@ case class Sort( | |||
|
|||
/** Factory for constructing new `Range` nodes. */ | |||
object Range { | |||
def apply(start: Long, end: Long, step: Long, numSlices: Option[Int]): Range = { | |||
def apply(start: Long, end: Long, step: Long, numSlices: Option[Int]) | |||
: LeafNode with MultiInstanceRelation = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this signature looks weird...
val output = StructType(StructField("id", LongType, nullable = false) :: Nil).toAttributes | ||
new Range(start, end, step, numSlices, output) | ||
if (start == end || (start < end ^ 0 < step)) { | ||
LocalRelation(output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have something like Range.empty
to still return a Range
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @cloud-fan . In that case, which spec can we use for that empty Range
which is returned from Range.empty
? For the following cases, it seems that we need to remove one of the following precondition again.
require(step != 0, s"step ($step) cannot be 0")
require(start != end, s"start ($step) cannot be equal to end ($end)")
require(start < end ^ step < 0, s"the sign of step ($step) is invalid for range ($start, $end)")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree that an empty Range
seems a more reasonable choose than a LocalRelation
to return for those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the values of parameters, start, end, step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a bit weird to see a LocalRelation
node when constructing a Range
node.
If possible, I'd let the empty Range
keeps its wrong parameters as is. So the logical plan can be consistent with input query. And only turning it to empty relation when planning the query.
Of course this involves few change in Range's planning. I'm not sure if it's acceptable to others. So I'm ok with current solution too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, at the first commit, I handled this invalid 'Range' in a new optimizer, 'RemoveInvalidRange'. That would be a similar approach.
I will wait more comment on this~ Thank you for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think doing it in optimization sounds not ok for me too. It's not an optimization actually. Anyway, let's wait for other comments. Maybe there's other good ways to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like empty Range
keeps its wrong parameters as is, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then do you mean removing the 'require' statements back? For the invalid parameters, 'require' raises exceptions now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk . What is the difference in your empty
Range and invalid Range? It looks to me the same in your suggestion.
@cloud-fan . For the signature, I used |
Test build #77878 has finished for PR 18257 at commit
|
Test build #77883 has finished for PR 18257 at commit
|
Test build #77881 has finished for PR 18257 at commit
|
@@ -500,6 +504,8 @@ case class Range( | |||
extends LeafNode with MultiInstanceRelation { | |||
|
|||
require(step != 0, s"step ($step) cannot be 0") | |||
require(start != end, s"start ($start) cannot be equal to end ($end)") | |||
require(start < end ^ step < 0, s"the sign of step ($step) is invalid for range ($start, $end)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow to return an empty relation/empty range in Range.apply, why we won't allow it too here? Although it's no-op, but looks like it's legal query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This precheck statements are added by review comments. This explains the assumptions used in the implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it? I saw the comments we should generate no-op code if the query is no-op? And you also said we cannot raise exceptions on Dataset operations inconsistently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already prevent them in the above factory method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. You assume we should only use factory Range
to construct the nodes.
I think How about we just improve the |
Current codegen implementation already generate empty result for the invalid case in general. This is about a trivial corner case. Is this corner case worth of making the codegen implement more complex? |
is it very hard? Seems we can just output 0-partition RDD for invalid parameters in |
I see, @cloud-fan. I'll update like that. |
sqlContext.sparkContext.parallelize(0 until numSlices, numSlices) | ||
.map(i => InternalRow(i)) :: Nil | ||
val rdd = if (start == end || (start < end ^ 0 < step)) { | ||
new EmptyRDD[InternalRow](sqlContext.sparkContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this idea.
Test build #77906 has finished for PR 18257 at commit
|
Seq("false", "true").foreach { value => | ||
withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> value) { | ||
assert(spark.sparkContext.range(start, end, 1).collect.length == 0) | ||
assert(spark.range(start, end, 1).collect.length == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also test the case start = end
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Test build #77915 has started for PR 18257 at commit |
Retest this please |
Test build #77920 has finished for PR 18257 at commit
|
…Context.range ## What changes were proposed in this pull request? This PR fixes the inconsistency in `SparkSession.range`. **BEFORE** ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806) ``` **AFTER** ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array() ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Author: Dongjoon Hyun <[email protected]> Closes #18257 from dongjoon-hyun/SPARK-21041. (cherry picked from commit a92e095) Signed-off-by: Wenchen Fan <[email protected]>
thanks, merging to master/2.2 |
Thank you, @cloud-fan ! |
…Context.range ## What changes were proposed in this pull request? This PR fixes the inconsistency in `SparkSession.range`. **BEFORE** ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806) ``` **AFTER** ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array() ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Author: Dongjoon Hyun <[email protected]> Closes apache#18257 from dongjoon-hyun/SPARK-21041.
What changes were proposed in this pull request?
This PR fixes the inconsistency in
SparkSession.range
.BEFORE
AFTER
How was this patch tested?
Pass the Jenkins with newly added test cases.