[SPARK-1837] NumericRange should be partitioned in the same way as other... #776

kanzhang · 2014-05-14T21:28:40Z

... sequences

…her sequences

AmplabJenkins · 2014-05-14T21:32:59Z

Can one of the admins verify this patch?

witgo · 2014-05-15T06:06:45Z

core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala

        var r = nr
-        for (i <- 0 until numSlices) {
+        for ((start, end) <- positions(nr.length, numSlices)) {
+          val sliceSize = end - start
          slices += r.take(sliceSize).asInstanceOf[Seq[T]]


scala> dr.take(1) res11: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0) scala> dr.take(2) res12: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0) scala> dr.take(3) res13: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0, 1.2) scala> dr.take(4) res14: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0, 1.2, 1.4, 1.5999999999999999) scala> lr.take(1) res15: scala.collection.immutable.NumericRange[Long] = NumericRange(1) scala> lr.take(2) res16: scala.collection.immutable.NumericRange[Long] = NumericRange(1, 4) scala> lr.take(3) res17: scala.collection.immutable.NumericRange[Long] = NumericRange(1, 4, 7) scala> lr.take(4) res18: scala.collection.immutable.NumericRange[Long] = NumericRange(1, 4, 7)

(1D to 2D).by(0.2).take(2) => NumericRange(1.0) why ? This is a bug?

Looks like a bug to me. This, in turn, is causing problems in RDD.

scala> sc.parallelize((1D to 2D).by(0.2), 2).collectPartitions res15: Array[Array[Double]] = Array(Array(1.0, 1.2), Array(1.6, 1.8))

This issue has been reported at Scala and is still open, https://issues.scala-lang.org/browse/SI-8518

Do you want to wait on this to be fixed by Scala, or do you want to work around it for now?

I'd wait for Scala to fix it. That said, I'm open to work around (I just don't see one myself).

Alright, maybe we should wait for Scala then. By the way, for your original use case, was the range you wanted always (0 to numElements)? If so you can also try RDD.zipWithIndex.

@mateiz the use case wasn't mine, it was from reporter of SPARK-1817. Btw, I think this PR can be committed independent of Scala fix. It fixes the issue for other numeric ranges (e.g., Long), and will also work on Double once the Scala fix is in.

Won't this patch make it lose numbers out of Double ranges? Whereas the current implementation works.

@mateiz the current implementation would lose elements for all types of numeric ranges (including Long and Double) when we zip a numeric range with other sequences, because we partition numeric ranges differently from other sequences. This patch fixes it by partitioning numeric ranges at exactly the same indexes as we would on other sequences. However, we still depend on take and drop being implemented correctly on numeric ranges for things to work. The Scala bug affects take and drop on Double ranges, but not on other numeric ranges like Long (hence, the unit tests in this patch, which are based on Long ranges, are successful).

Okay, that makes sense then; I didn't realize that we were already using drop and take. In that case we should merge this patch as is and maybe create a JIRA for Double ranges so people see it's a known issue. Made one other small comment on the patch.

mateiz · 2014-05-17T02:20:07Z

Jenkins, this is ok to test

AmplabJenkins · 2014-05-17T02:22:59Z

Merged build triggered.

AmplabJenkins · 2014-05-17T02:23:05Z

Merged build started.

AmplabJenkins · 2014-05-17T03:03:24Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-17T03:03:25Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15060/

mateiz · 2014-06-02T00:27:51Z

core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala

@@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
    if (numSlices < 1) {
      throw new IllegalArgumentException("Positive number of slices required")
    }
+    // Sequences need to be sliced at same positions for operations
+    // like RDD.zip() to behave as expected
+    def positions(length: Long, numSlices: Int) = {


This needs an explicit return type (e.g. : Seq[(Int, Int)])

Actually it would be better if this returned an Iterator, so that it doesn't materialize the whole sequence. You can do (0 until numSlices).iterator.map(...).

This needs an explicit return type (e.g. : Seq[(Int, Int)])

For binary compatibility?

This is just our style throughout the code. It makes it easier to avoid compatibility-breaking changes.

kanzhang · 2014-06-03T18:42:19Z

Updated patch based on @mateiz comments.

AmplabJenkins · 2014-06-03T18:42:58Z

Merged build triggered.

AmplabJenkins · 2014-06-03T18:43:08Z

Merged build started.

AmplabJenkins · 2014-06-03T20:14:01Z

Merged build finished.

AmplabJenkins · 2014-06-03T20:14:01Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15392/

mateiz · 2014-06-03T20:23:03Z

Jenkins, retest this please

AmplabJenkins · 2014-06-03T20:28:00Z

Merged build triggered.

AmplabJenkins · 2014-06-03T20:28:06Z

Merged build started.

AmplabJenkins · 2014-06-03T21:08:53Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-03T21:08:53Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15399/

kanzhang · 2014-06-13T17:25:46Z

@mateiz could you take another look at this when you get a chance? SPARK-1817 has been marked as resolved, but the fix for the original issue depends on this patch. Thx.

mateiz · 2014-06-14T19:56:30Z

Oh, sorry, I forgot to merge this after testing it. Jenkins, retest this please.

AmplabJenkins · 2014-06-14T19:59:39Z

Merged build triggered.

AmplabJenkins · 2014-06-14T19:59:48Z

Merged build started.

AmplabJenkins · 2014-06-14T20:43:29Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-14T20:43:30Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15790/

mateiz · 2014-06-14T21:32:30Z

Alright, merged this. Thanks!

…her... ... sequences Author: Kan Zhang <[email protected]> Closes apache#776 from kanzhang/SPARK-1837 and squashes the following commits: e48f018 [Kan Zhang] [SPARK-1837] code refactoring 67c33b5 [Kan Zhang] minor change 403f9b1 [Kan Zhang] [SPARK-1837] NumericRange should be partitioned in the same way as other sequences

[SPARK-1837] NumericRange should be partitioned in the same way as ot…

403f9b1

…her sequences

kanzhang mentioned this pull request May 14, 2014

[SPARK-1817] RDD.zip() should verify partition sizes for each partition #760

Closed

witgo reviewed May 15, 2014
View reviewed changes

minor change

67c33b5

mateiz reviewed Jun 2, 2014
View reviewed changes

kanzhang mentioned this pull request Jun 2, 2014

[SPARK-1817] RDD.zip() should verify partition sizes for each partition #944

Closed

[SPARK-1837] code refactoring

e48f018

asfgit closed this in 7dd9fc6 Jun 14, 2014

kanzhang deleted the SPARK-1837 branch June 16, 2014 01:01

[SPARK-1837] NumericRange should be partitioned in the same way as other... #776

[SPARK-1837] NumericRange should be partitioned in the same way as other... #776

Conversation

kanzhang commented May 14, 2014

AmplabJenkins commented May 14, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mateiz commented May 17, 2014

AmplabJenkins commented May 17, 2014

AmplabJenkins commented May 17, 2014

AmplabJenkins commented May 17, 2014

AmplabJenkins commented May 17, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kanzhang commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

mateiz commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

kanzhang commented Jun 13, 2014

mateiz commented Jun 14, 2014

AmplabJenkins commented Jun 14, 2014

AmplabJenkins commented Jun 14, 2014

AmplabJenkins commented Jun 14, 2014

AmplabJenkins commented Jun 14, 2014

mateiz commented Jun 14, 2014