SPARK-1939 Refactor takeSample method in RDD to use ScaSRS #916

dorx · 2014-05-29T22:26:59Z

Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.

AmplabJenkins · 2014-05-29T22:27:58Z

Can one of the admins verify this patch?

mengxr · 2014-05-29T22:53:01Z

Jenkins, test this please.

mengxr · 2014-05-29T22:53:09Z

Jenkins, add to whitelist.

mengxr · 2014-05-29T22:56:46Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

+   * @return sample of specified size in an array
+   */
+  def takeSample(withReplacement: Boolean,
+                 num: Int,


use 4-space indentation

AmplabJenkins · 2014-05-29T22:57:59Z

Merged build triggered.

AmplabJenkins · 2014-05-29T22:58:07Z

Merged build started.

mengxr · 2014-05-29T22:59:02Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

@@ -402,10 +411,11 @@ abstract class RDD[T: ClassTag](
    }

    if (num > initialCount && !withReplacement) {
+      // special case not covered in computeFraction


If sample without replacement, num cannot be greater than initialCount. What is block for?

Legacy code to prevent overflow if initialCount = Integer.MAX_VALUE

I don't think it can really prevent overflow. The fraction is chosen as 3 * INT_MAX / count, which means the expect sample size is 3 * INT_MAX > INT_MAX. So collect() will throw an exception almost surely.

AmplabJenkins · 2014-05-29T23:36:15Z

Merged build finished.

AmplabJenkins · 2014-05-29T23:36:16Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15295/

Reviewer comments addressed: - commons-math3 is now a test-only dependency. bumped up to v3.3 - comments added to explain what computeFraction is doing - fixed the unit for computeFraction to use BinomialDitro for without replacement sampling - stylistic fixes

AmplabJenkins · 2014-05-30T00:57:58Z

Merged build triggered.

AmplabJenkins · 2014-05-30T00:58:04Z

Merged build started.

dorx · 2014-05-30T00:58:27Z

@mengxr do your worst

AmplabJenkins · 2014-05-30T01:36:24Z

Merged build finished.

AmplabJenkins · 2014-05-30T01:36:24Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15297/

AmplabJenkins · 2014-06-13T00:37:15Z

Build started.

AmplabJenkins · 2014-06-13T00:42:06Z

Build triggered.

AmplabJenkins · 2014-06-13T00:42:15Z

Build started.

mengxr · 2014-06-13T00:56:34Z

LGTM. Thanks! Waiting for Jenkins ...

AmplabJenkins · 2014-06-13T01:08:21Z

Build finished.

AmplabJenkins · 2014-06-13T01:08:21Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15738/

AmplabJenkins · 2014-06-13T01:22:05Z

Merged build triggered.

AmplabJenkins · 2014-06-13T01:49:58Z

Merged build started.

AmplabJenkins · 2014-06-13T02:07:40Z

Build finished.

AmplabJenkins · 2014-06-13T02:07:40Z

Build finished.

AmplabJenkins · 2014-06-13T02:07:40Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15743/

AmplabJenkins · 2014-06-13T02:07:40Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15742/

AmplabJenkins · 2014-06-13T02:32:57Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-13T02:32:57Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15747/

mengxr · 2014-06-13T02:45:03Z

Merged. Thanks!

colorant · 2014-06-13T07:06:37Z

@dorx Do you think this works for extreme large data set with really small sample size? e.g. n = 1.0x10^10 while sample = 1 ? in that case, the final adjusted fraction lead to around 1.2x10^-9, by theory, there are still 99.99 chance to get sample. But since Double also has precision issue, and Random afterall is not true random. so do you think it is enough to guarantee 99.99 chance under this extreme condition? I am wondering about this is because, Actually, in the very case, the original code (3x(1+1)) / total will give a fraction around 6x10^-10, which is just about half size of the new code. And under that fraction value. it keep loop for ever and never did get a chance to return that 1 sample.

mengxr · 2014-06-13T09:28:46Z

@colorant Tried the following with the new implementation:

val rdd = sc.parallelize(0 until 1000000000, 10000).flatMap(i => Iterator.fill(10)(0)) // 10^10
rdd.takeSample(false, 1).size
rdd.takeSample(true, 1).size

Both worked well. We might need a better RNG for even smaller sampling probabilities. Another solution is set a lower bound in comptueFractionForSampleSize, e.g, 10^{-9}. I prefer the latter to avoid using expensive RNGs. Could you run some tests and derive a good lower bound? Thanks!

dorx · 2014-06-13T20:59:19Z

@colorant Thanks for taking a look at this!

First of all let me just say that I ran Xiangrui's code but with ".fill(1000)" (so 100x in RDD size), and it was still able to select a sample with exactly one data point in one pass.

So there's a couple things in play here. The smallest resolution handled by a Double is 2^(-1074) ~ 5e-324, so before we run into RDDs of size ~10^323, we in theory won't run into have a sampling rate of 0. Then it comes down to whether the random number generator is truly random and isn't biased against very small numbers. The two experiments Xiangrui and I ran seem to suggest that the java.util.Random object is able to produce small enough random numbers. However, we should definitely further investigate the quality of the RNG used to gauge sampling behavior at even smaller sampling rates.

One thing to note about this implementation is that at higher sampling rates, we are actually able to save memory by not caching as many samples as before in order to be able to guarantee the sample size in one try.

Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate. Author: Doris Xin <[email protected]> Author: dorx <[email protected]> Author: Xiangrui Meng <[email protected]> Closes apache#916 from dorx/takeSample and squashes the following commits: 5b061ae [Doris Xin] merge master 444e750 [Doris Xin] edge cases 3de882b [dorx] Merge pull request apache#2 from mengxr/SPARK-1939 82dde31 [Xiangrui Meng] update pyspark's takeSample 48d954d [Doris Xin] remove unused imports from RDDSuite fb1452f [Doris Xin] allowing num to be greater than count in all cases 1481b01 [Doris Xin] washing test tubes and making coffee dc699f3 [Doris Xin] give back imports removed by accident in rdd.py 64e445b [Doris Xin] logwarnning as soon as it enters the while loop 55518ed [Doris Xin] added TODO for logging in rdd.py eff89e2 [Doris Xin] addressed reviewer comments. ecab508 [Doris Xin] "fixed checkstyle violation 0a9b3e3 [Doris Xin] "reviewer comment addressed" f80f270 [Doris Xin] Merge branch 'master' into takeSample ae3ad04 [Doris Xin] fixed edge cases to prevent overflow 065ebcd [Doris Xin] Merge branch 'master' into takeSample 9bdd36e [Doris Xin] Check sample size and move computeFraction e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample 7cab53a [Doris Xin] fixed import bug in rdd.py ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS

This reverts commit 8a7c1a9.

…org.apache.curator.framework.api.ProtectACLCreateModePathAndBytesable org.apache.curator.framework.api.CreateBuilder.creatingParentsIfNeeded()' (apache#916)

SPARK-1939 Refactor takeSample method in RDD to use ScaSRS

1441977

mengxr reviewed May 29, 2014
View reviewed changes

edge cases

444e750

merge master

5b061ae

asfgit closed this in 1de1d70 Jun 13, 2014

dorx deleted the takeSample branch June 18, 2014 20:38

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021

Add Iceberg as a dep (apache#916)

8a7c1a9

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021

Revert "Add Iceberg as a dep (apache#916)"

d96abce

This reverts commit 8a7c1a9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1939 Refactor takeSample method in RDD to use ScaSRS #916

SPARK-1939 Refactor takeSample method in RDD to use ScaSRS #916

dorx commented May 29, 2014

AmplabJenkins commented May 29, 2014

mengxr commented May 29, 2014

mengxr commented May 29, 2014

mengxr May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

mengxr May 29, 2014

dorx May 29, 2014

mengxr Jun 3, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented May 30, 2014

dorx commented May 30, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

mengxr commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

mengxr commented Jun 13, 2014

colorant commented Jun 13, 2014

mengxr commented Jun 13, 2014

dorx commented Jun 13, 2014

SPARK-1939 Refactor takeSample method in RDD to use ScaSRS #916

SPARK-1939 Refactor takeSample method in RDD to use ScaSRS #916

Conversation

dorx commented May 29, 2014

AmplabJenkins commented May 29, 2014

mengxr commented May 29, 2014

mengxr commented May 29, 2014

mengxr May 29, 2014

Choose a reason for hiding this comment

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

mengxr May 29, 2014

Choose a reason for hiding this comment

dorx May 29, 2014

Choose a reason for hiding this comment

mengxr Jun 3, 2014

Choose a reason for hiding this comment

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented May 30, 2014

dorx commented May 30, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

mengxr commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

AmplabJenkins commented Jun 13, 2014

mengxr commented Jun 13, 2014

colorant commented Jun 13, 2014

mengxr commented Jun 13, 2014

dorx commented Jun 13, 2014