[SPARK-21041][SQL] SparkSession.range should be consistent with SparkContext.range #18257

dongjoon-hyun · 2017-06-10T02:08:09Z

What changes were proposed in this pull request?

This PR fixes the inconsistency in SparkSession.range.

BEFORE

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect
res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806)

AFTER

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect
res2: Array[Long] = Array()

How was this patch tested?

Pass the Jenkins with newly added test cases.

**BEFORE** ``` scala> spark.range(0,0,1).explain == Physical Plan == *Range (0, 0, step=1, splits=8) ``` **AFTER** ``` scala> spark.range(0,0,1).explain == Physical Plan == LocalTableScan <empty>, [id#0L] ```

SparkQA · 2017-06-10T04:31:23Z

Test build #77865 has finished for PR 18257 at commit 6ab9b6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rednaxelafx · 2017-06-10T05:55:23Z

The end result looks good to me. Thanks for fixing it!

Although, I'd prefer fixing the actual integer overflow handling in RangeExec's codegen, too. Even though your fix will handle the problematic case for sure, it still feels a bit awkward to rely on an optimization for correctness.

dongjoon-hyun · 2017-06-10T07:31:52Z

Thanks. For me, this optimizer does more than the fixing. You can make another PR for the underlying codegen if you want.

kiszk · 2017-06-10T10:15:51Z

Interesting discussion.
@dongjoon-hyun Would it be possible to share among us your thought on why this optimizer does more than fixing the codegen?

dongjoon-hyun · 2017-06-10T15:11:28Z

Sure!

First of all, this handles another invalid case, Range(0,0).

Second, although we fix the generated codes in the reported in SPARK-21041, we still generate No-Op code.

Third, this PR removes the usuless No-Op codegen and Janino compilation and so on. This also will cowork with the other optimizer like PripagateEmptyRelation.

Finally, we can simplify the underlying logic later by preventing this kind of corner cases as eariler as possible.

dongjoon-hyun · 2017-06-10T15:57:18Z

Hi, @cloud-fan and @gatorsmile .
Could you review this PR?

kiszk · 2017-06-10T16:54:06Z

@dongjoon-hyun thank you. IMHO, the codegen can handle the first point. However, the second and third points make sense. The codegen cannot achieve.
Finally, it seems to be design decision about what is the Catalyst optimizer role. Only optimization, or optimization and correctness.
I would like to hear opinions of @cloud-fan and @gatorsmile, too.

wzhfy · 2017-06-10T17:55:41Z

Should we just invalidate such cases? Like we do for step cannot be 0.

dongjoon-hyun · 2017-06-10T17:59:29Z

That's also a good another candidate. In this PR, I just didn't want to surprise the users with the behavior's changes.

gatorsmile · 2017-06-10T18:06:52Z

@dongjoon-hyun Conceptually, we should not fix the correctness by adding optimization rules.

Fix the issues when building the Range.
Also need to add the extra checking logics here to block all the invalid cases.

dongjoon-hyun · 2017-06-10T18:10:24Z

If then, I can remove the SPARK-21041 from the title and PR description in order to reduce the scope. Also, I can remove the test case too.

dongjoon-hyun · 2017-06-10T18:11:31Z

I want to make another PR for the fixing bug. Is it okay for you all?

dongjoon-hyun · 2017-06-10T18:14:59Z

Hmm. Ah, sorry. I'll update this PR since the optimizer is not valid anymore.

dongjoon-hyun · 2017-06-10T18:33:08Z

@gatorsmile Are you sure to raise exception on empty range, Range(0, 0), really?

wzhfy · 2017-06-10T18:33:52Z

Yea, let's update this pr so that previous discussions stay in the same thread.

dongjoon-hyun · 2017-06-10T18:36:51Z

Yep, I'm updating here, @wzhfy .
I will close the SPARK-21044 as Invalid after this is merged.

wzhfy · 2017-06-10T19:00:23Z

@dongjoon-hyun Based on our definition of range, since the default step is 1, I think range(0, 0) is invalid.

dongjoon-hyun · 2017-06-10T19:17:41Z

The problem is Spark RDD supports that in that way until now. We cannot raise exceptions on Dataset operations inconsistently.

scala> sc.range(0,0).collect
res1: Array[Long] = Array()

scala> sc.range(1,10,-1).collect
res2: Array[Long] = Array()

rxin · 2017-06-10T19:27:21Z

Sorry it doesn't make sense to do this. Range is used primarily for testing, and it doesn't make sense to have an optimizer rule that removes it. If there is a correctness issue in it, we should fix that.

And it is perfectly fine to generate no-op code if the query is no-op.

dongjoon-hyun · 2017-06-10T19:30:20Z

@rxin . Yep. The scope is changed like that.

dongjoon-hyun · 2017-06-10T19:49:54Z

Thank you for review again, @rednaxelafx , @kiszk , @wzhfy , @gatorsmile , @rxin .
Now, the corner cases are replaced at the factory and the behavior are consistent.

cloud-fan · 2017-06-10T19:58:26Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -482,11 +482,17 @@ case class Sort(

 /** Factory for constructing new `Range` nodes. */
 object Range {
-  def apply(start: Long, end: Long, step: Long, numSlices: Option[Int]): Range = {
+  def apply(start: Long, end: Long, step: Long, numSlices: Option[Int])
+      : LeafNode with MultiInstanceRelation = {


this signature looks weird...

cloud-fan · 2017-06-10T19:59:10Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

    val output = StructType(StructField("id", LongType, nullable = false) :: Nil).toAttributes
-    new Range(start, end, step, numSlices, output)
+    if (start == end || (start < end ^ 0 < step)) {
+      LocalRelation(output)


Can we have something like Range.empty to still return a Range?

Thank you for review, @cloud-fan . In that case, which spec can we use for that empty Range which is returned from Range.empty? For the following cases, it seems that we need to remove one of the following precondition again.

require(step != 0, s"step ($step) cannot be 0") require(start != end, s"start ($step) cannot be equal to end ($end)") require(start < end ^ step < 0, s"the sign of step ($step) is invalid for range ($start, $end)")

I'd agree that an empty Range seems a more reasonable choose than a LocalRelation to return for those cases.

What is the values of parameters, start, end, step?

It looks a bit weird to see a LocalRelation node when constructing a Range node.

If possible, I'd let the empty Range keeps its wrong parameters as is. So the logical plan can be consistent with input query. And only turning it to empty relation when planning the query.

Of course this involves few change in Range's planning. I'm not sure if it's acceptable to others. So I'm ok with current solution too.

Actually, at the first commit, I handled this invalid 'Range' in a new optimizer, 'RemoveInvalidRange'. That would be a similar approach.

I will wait more comment on this~ Thank you for review.

Hmm, I think doing it in optimization sounds not ok for me too. It's not an optimization actually. Anyway, let's wait for other comments. Maybe there's other good ways to do.

I like empty Range keeps its wrong parameters as is, too.

Then do you mean removing the 'require' statements back? For the invalid parameters, 'require' raises exceptions now.

@kiszk . What is the difference in your empty Range and invalid Range? It looks to me the same in your suggestion.

dongjoon-hyun · 2017-06-10T21:39:45Z

@cloud-fan . For the signature, I used LogicalPlan instead.

SparkQA · 2017-06-10T22:34:01Z

Test build #77878 has finished for PR 18257 at commit fe9d98b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-11T00:15:54Z

Test build #77883 has finished for PR 18257 at commit 84b87b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-11T00:20:12Z

Test build #77881 has finished for PR 18257 at commit c4ad4c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-06-11T02:30:59Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -500,6 +504,8 @@ case class Range(
  extends LeafNode with MultiInstanceRelation {

  require(step != 0, s"step ($step) cannot be 0")
+  require(start != end, s"start ($start) cannot be equal to end ($end)")
+  require(start < end ^ step < 0, s"the sign of step ($step) is invalid for range ($start, $end)")


If we allow to return an empty relation/empty range in Range.apply, why we won't allow it too here? Although it's no-op, but looks like it's legal query?

This precheck statements are added by review comments. This explains the assumptions used in the implementations.

Is it? I saw the comments we should generate no-op code if the query is no-op? And you also said we cannot raise exceptions on Dataset operations inconsistently?

We already prevent them in the above factory method.

I see. You assume we should only use factory Range to construct the nodes.

cloud-fan · 2017-06-12T00:33:54Z

I think Range is mostly for testing, and we don't need to worry about performance here, especially for empty range.

How about we just improve the Range to produce empty result for invalid parameters?

dongjoon-hyun · 2017-06-12T00:49:04Z

Current codegen implementation already generate empty result for the invalid case in general. This is about a trivial corner case. Is this corner case worth of making the codegen implement more complex?

cloud-fan · 2017-06-12T01:01:28Z

is it very hard? Seems we can just output 0-partition RDD for invalid parameters in RangeExec.

dongjoon-hyun · 2017-06-12T01:18:26Z

I see, @cloud-fan. I'll update like that.

dongjoon-hyun · 2017-06-12T02:47:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

-    sqlContext.sparkContext.parallelize(0 until numSlices, numSlices)
-      .map(i => InternalRow(i)) :: Nil
+    val rdd = if (start == end || (start < end ^ 0 < step)) {
+      new EmptyRDD[InternalRow](sqlContext.sparkContext)


Thank you for this idea.

SparkQA · 2017-06-12T05:19:39Z

Test build #77906 has finished for PR 18257 at commit 89dd7ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-06-12T05:29:57Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameRangeSuite.scala

+    Seq("false", "true").foreach { value =>
+      withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> value) {
+        assert(spark.sparkContext.range(start, end, 1).collect.length == 0)
+        assert(spark.range(start, end, 1).collect.length == 0)


Shall we also test the case start = end?

… test case.

SparkQA · 2017-06-12T05:58:36Z

Test build #77915 has started for PR 18257 at commit 46f60f0.

dongjoon-hyun · 2017-06-12T07:21:47Z

Retest this please

SparkQA · 2017-06-12T10:03:10Z

Test build #77920 has finished for PR 18257 at commit 46f60f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…Context.range ## What changes were proposed in this pull request? This PR fixes the inconsistency in `SparkSession.range`. **BEFORE** ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806) ``` **AFTER** ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array() ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Author: Dongjoon Hyun <[email protected]> Closes #18257 from dongjoon-hyun/SPARK-21041. (cherry picked from commit a92e095) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2017-06-12T13:00:21Z

thanks, merging to master/2.2

dongjoon-hyun · 2017-06-12T14:05:50Z

Thank you, @cloud-fan !
Also, thank you for all reviewers.

…Context.range ## What changes were proposed in this pull request? This PR fixes the inconsistency in `SparkSession.range`. **BEFORE** ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806) ``` **AFTER** ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array() ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Author: Dongjoon Hyun <[email protected]> Closes apache#18257 from dongjoon-hyun/SPARK-21041.

[SPARK-21044][SPARK-21041][SQL] Add RemoveInvalidRange optimizer

6ab9b6f

**BEFORE** ``` scala> spark.range(0,0,1).explain == Physical Plan == *Range (0, 0, step=1, splits=8) ``` **AFTER** ``` scala> spark.range(0,0,1).explain == Physical Plan == LocalTableScan <empty>, [id#0L] ```

dongjoon-hyun changed the title ~~[SPARK-21044][SPARK-21041][SQL] Add RemoveInvalidRange optimizer~~ [SPARK-21041][SQL] SparkSession.range should be consistent with SparkContext.range Jun 10, 2017

address comments

fe9d98b

cloud-fan reviewed Jun 10, 2017

View reviewed changes

Use LogicalPlan instead.

c4ad4c5

fix.

84b87b7

viirya reviewed Jun 11, 2017

View reviewed changes

address comments.

89dd7ad

dongjoon-hyun commented Jun 12, 2017

View reviewed changes

viirya reviewed Jun 12, 2017

View reviewed changes

Add start == end test case and remove redundant SparkContext.rage…

46f60f0

… test case.

asfgit closed this in a92e095 Jun 12, 2017

dongjoon-hyun deleted the SPARK-21041 branch June 12, 2017 14:06

[SPARK-21041][SQL] SparkSession.range should be consistent with SparkContext.range #18257

[SPARK-21041][SQL] SparkSession.range should be consistent with SparkContext.range #18257

Conversation

dongjoon-hyun commented Jun 10, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 10, 2017

rednaxelafx commented Jun 10, 2017

dongjoon-hyun commented Jun 10, 2017

kiszk commented Jun 10, 2017

dongjoon-hyun commented Jun 10, 2017

dongjoon-hyun commented Jun 10, 2017

kiszk commented Jun 10, 2017

wzhfy commented Jun 10, 2017

dongjoon-hyun commented Jun 10, 2017

gatorsmile commented Jun 10, 2017 • edited Loading

dongjoon-hyun commented Jun 10, 2017

dongjoon-hyun commented Jun 10, 2017

dongjoon-hyun commented Jun 10, 2017

dongjoon-hyun commented Jun 10, 2017

wzhfy commented Jun 10, 2017

dongjoon-hyun commented Jun 10, 2017

wzhfy commented Jun 10, 2017 • edited Loading

dongjoon-hyun commented Jun 10, 2017

rxin commented Jun 10, 2017 • edited Loading

dongjoon-hyun commented Jun 10, 2017

dongjoon-hyun commented Jun 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Jun 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 10, 2017

SparkQA commented Jun 10, 2017

SparkQA commented Jun 11, 2017

SparkQA commented Jun 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jun 12, 2017

dongjoon-hyun commented Jun 12, 2017

cloud-fan commented Jun 12, 2017

dongjoon-hyun commented Jun 12, 2017

Choose a reason for hiding this comment

SparkQA commented Jun 12, 2017

viirya Jun 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 12, 2017

dongjoon-hyun commented Jun 12, 2017

SparkQA commented Jun 12, 2017

cloud-fan commented Jun 12, 2017

dongjoon-hyun commented Jun 12, 2017

dongjoon-hyun commented Jun 10, 2017 •

edited

Loading

gatorsmile commented Jun 10, 2017 •

edited

Loading

wzhfy commented Jun 10, 2017 •

edited

Loading

rxin commented Jun 10, 2017 •

edited

Loading

viirya Jun 11, 2017 •

edited

Loading

viirya Jun 12, 2017 •

edited

Loading