Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21041][SQL] SparkSession.range should be consistent with SparkContext.range #18257

Closed
wants to merge 6 commits into from
Closed

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 10, 2017

What changes were proposed in this pull request?

This PR fixes the inconsistency in SparkSession.range.

BEFORE

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect
res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806)

AFTER

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect
res2: Array[Long] = Array()

How was this patch tested?

Pass the Jenkins with newly added test cases.

**BEFORE**
```
scala> spark.range(0,0,1).explain
== Physical Plan ==
*Range (0, 0, step=1, splits=8)
```

**AFTER**
```
scala> spark.range(0,0,1).explain
== Physical Plan ==
LocalTableScan <empty>, [id#0L]
```
@SparkQA
Copy link

SparkQA commented Jun 10, 2017

Test build #77865 has finished for PR 18257 at commit 6ab9b6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rednaxelafx
Copy link
Contributor

The end result looks good to me. Thanks for fixing it!

Although, I'd prefer fixing the actual integer overflow handling in RangeExec's codegen, too. Even though your fix will handle the problematic case for sure, it still feels a bit awkward to rely on an optimization for correctness.

@dongjoon-hyun
Copy link
Member Author

Thanks. For me, this optimizer does more than the fixing. You can make another PR for the underlying codegen if you want.

@kiszk
Copy link
Member

kiszk commented Jun 10, 2017

Interesting discussion.
@dongjoon-hyun Would it be possible to share among us your thought on why this optimizer does more than fixing the codegen?

@dongjoon-hyun
Copy link
Member Author

Sure!

First of all, this handles another invalid case, Range(0,0).

Second, although we fix the generated codes in the reported in SPARK-21041, we still generate No-Op code.

Third, this PR removes the usuless No-Op codegen and Janino compilation and so on. This also will cowork with the other optimizer like PripagateEmptyRelation.

Finally, we can simplify the underlying logic later by preventing this kind of corner cases as eariler as possible.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan and @gatorsmile .
Could you review this PR?

@kiszk
Copy link
Member

kiszk commented Jun 10, 2017

@dongjoon-hyun thank you. IMHO, the codegen can handle the first point. However, the second and third points make sense. The codegen cannot achieve.
Finally, it seems to be design decision about what is the Catalyst optimizer role. Only optimization, or optimization and correctness.
I would like to hear opinions of @cloud-fan and @gatorsmile, too.

@wzhfy
Copy link
Contributor

wzhfy commented Jun 10, 2017

Should we just invalidate such cases? Like we do for step cannot be 0.

@dongjoon-hyun
Copy link
Member Author

That's also a good another candidate. In this PR, I just didn't want to surprise the users with the behavior's changes.

@gatorsmile
Copy link
Member

gatorsmile commented Jun 10, 2017

@dongjoon-hyun Conceptually, we should not fix the correctness by adding optimization rules.

  • Fix the issues when building the Range.
  • Also need to add the extra checking logics here to block all the invalid cases.

@dongjoon-hyun
Copy link
Member Author

If then, I can remove the SPARK-21041 from the title and PR description in order to reduce the scope. Also, I can remove the test case too.

@dongjoon-hyun
Copy link
Member Author

I want to make another PR for the fixing bug. Is it okay for you all?

@dongjoon-hyun
Copy link
Member Author

Hmm. Ah, sorry. I'll update this PR since the optimizer is not valid anymore.

@dongjoon-hyun
Copy link
Member Author

@gatorsmile Are you sure to raise exception on empty range, Range(0, 0), really?

@wzhfy
Copy link
Contributor

wzhfy commented Jun 10, 2017

Yea, let's update this pr so that previous discussions stay in the same thread.

@dongjoon-hyun
Copy link
Member Author

Yep, I'm updating here, @wzhfy .
I will close the SPARK-21044 as Invalid after this is merged.

@wzhfy
Copy link
Contributor

wzhfy commented Jun 10, 2017

@dongjoon-hyun Based on our definition of range, since the default step is 1, I think range(0, 0) is invalid.

@dongjoon-hyun
Copy link
Member Author

The problem is Spark RDD supports that in that way until now. We cannot raise exceptions on Dataset operations inconsistently.

scala> sc.range(0,0).collect
res1: Array[Long] = Array()

scala> sc.range(1,10,-1).collect
res2: Array[Long] = Array()

@rxin
Copy link
Contributor

rxin commented Jun 10, 2017

Sorry it doesn't make sense to do this. Range is used primarily for testing, and it doesn't make sense to have an optimizer rule that removes it. If there is a correctness issue in it, we should fix that.

And it is perfectly fine to generate no-op code if the query is no-op.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21044][SPARK-21041][SQL] Add RemoveInvalidRange optimizer [SPARK-21041][SQL] SparkSession.range should be consistent with SparkContext.range Jun 10, 2017
@dongjoon-hyun
Copy link
Member Author

@rxin . Yep. The scope is changed like that.

@dongjoon-hyun
Copy link
Member Author

Thank you for review again, @rednaxelafx , @kiszk , @wzhfy , @gatorsmile , @rxin .
Now, the corner cases are replaced at the factory and the behavior are consistent.

@@ -482,11 +482,17 @@ case class Sort(

/** Factory for constructing new `Range` nodes. */
object Range {
def apply(start: Long, end: Long, step: Long, numSlices: Option[Int]): Range = {
def apply(start: Long, end: Long, step: Long, numSlices: Option[Int])
: LeafNode with MultiInstanceRelation = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this signature looks weird...

val output = StructType(StructField("id", LongType, nullable = false) :: Nil).toAttributes
new Range(start, end, step, numSlices, output)
if (start == end || (start < end ^ 0 < step)) {
LocalRelation(output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have something like Range.empty to still return a Range?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @cloud-fan . In that case, which spec can we use for that empty Range which is returned from Range.empty? For the following cases, it seems that we need to remove one of the following precondition again.

  require(step != 0, s"step ($step) cannot be 0")
  require(start != end, s"start ($step) cannot be equal to end ($end)")
  require(start < end ^ step < 0, s"the sign of step ($step) is invalid for range ($start, $end)")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree that an empty Range seems a more reasonable choose than a LocalRelation to return for those cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the values of parameters, start, end, step?

Copy link
Member

@viirya viirya Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit weird to see a LocalRelation node when constructing a Range node.

If possible, I'd let the empty Range keeps its wrong parameters as is. So the logical plan can be consistent with input query. And only turning it to empty relation when planning the query.

Of course this involves few change in Range's planning. I'm not sure if it's acceptable to others. So I'm ok with current solution too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, at the first commit, I handled this invalid 'Range' in a new optimizer, 'RemoveInvalidRange'. That would be a similar approach.

I will wait more comment on this~ Thank you for review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think doing it in optimization sounds not ok for me too. It's not an optimization actually. Anyway, let's wait for other comments. Maybe there's other good ways to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like empty Range keeps its wrong parameters as is, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then do you mean removing the 'require' statements back? For the invalid parameters, 'require' raises exceptions now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk . What is the difference in your empty Range and invalid Range? It looks to me the same in your suggestion.

@dongjoon-hyun
Copy link
Member Author

@cloud-fan . For the signature, I used LogicalPlan instead.

@SparkQA
Copy link

SparkQA commented Jun 10, 2017

Test build #77878 has finished for PR 18257 at commit fe9d98b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2017

Test build #77883 has finished for PR 18257 at commit 84b87b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2017

Test build #77881 has finished for PR 18257 at commit c4ad4c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -500,6 +504,8 @@ case class Range(
extends LeafNode with MultiInstanceRelation {

require(step != 0, s"step ($step) cannot be 0")
require(start != end, s"start ($start) cannot be equal to end ($end)")
require(start < end ^ step < 0, s"the sign of step ($step) is invalid for range ($start, $end)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow to return an empty relation/empty range in Range.apply, why we won't allow it too here? Although it's no-op, but looks like it's legal query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This precheck statements are added by review comments. This explains the assumptions used in the implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? I saw the comments we should generate no-op code if the query is no-op? And you also said we cannot raise exceptions on Dataset operations inconsistently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already prevent them in the above factory method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. You assume we should only use factory Range to construct the nodes.

@cloud-fan
Copy link
Contributor

I think Range is mostly for testing, and we don't need to worry about performance here, especially for empty range.

How about we just improve the Range to produce empty result for invalid parameters?

@dongjoon-hyun
Copy link
Member Author

Current codegen implementation already generate empty result for the invalid case in general. This is about a trivial corner case. Is this corner case worth of making the codegen implement more complex?

@cloud-fan
Copy link
Contributor

is it very hard? Seems we can just output 0-partition RDD for invalid parameters in RangeExec.

@dongjoon-hyun
Copy link
Member Author

I see, @cloud-fan. I'll update like that.

sqlContext.sparkContext.parallelize(0 until numSlices, numSlices)
.map(i => InternalRow(i)) :: Nil
val rdd = if (start == end || (start < end ^ 0 < step)) {
new EmptyRDD[InternalRow](sqlContext.sparkContext)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this idea.

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77906 has finished for PR 18257 at commit 89dd7ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Seq("false", "true").foreach { value =>
withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> value) {
assert(spark.sparkContext.range(start, end, 1).collect.length == 0)
assert(spark.range(start, end, 1).collect.length == 0)
Copy link
Member

@viirya viirya Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also test the case start = end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77915 has started for PR 18257 at commit 46f60f0.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77920 has finished for PR 18257 at commit 46f60f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 12, 2017
…Context.range

## What changes were proposed in this pull request?

This PR fixes the inconsistency in `SparkSession.range`.

**BEFORE**
```scala
scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect
res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806)
```

**AFTER**
```scala
scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect
res2: Array[Long] = Array()
```

## How was this patch tested?

Pass the Jenkins with newly added test cases.

Author: Dongjoon Hyun <[email protected]>

Closes #18257 from dongjoon-hyun/SPARK-21041.

(cherry picked from commit a92e095)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2

@asfgit asfgit closed this in a92e095 Jun 12, 2017
@dongjoon-hyun
Copy link
Member Author

Thank you, @cloud-fan !
Also, thank you for all reviewers.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-21041 branch June 12, 2017 14:06
dataknocker pushed a commit to dataknocker/spark that referenced this pull request Jun 16, 2017
…Context.range

## What changes were proposed in this pull request?

This PR fixes the inconsistency in `SparkSession.range`.

**BEFORE**
```scala
scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect
res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806)
```

**AFTER**
```scala
scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect
res2: Array[Long] = Array()
```

## How was this patch tested?

Pass the Jenkins with newly added test cases.

Author: Dongjoon Hyun <[email protected]>

Closes apache#18257 from dongjoon-hyun/SPARK-21041.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants