Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1939 Refactor takeSample method in RDD to use ScaSRS #916

Closed
wants to merge 21 commits into from

Conversation

dorx
Copy link
Contributor

@dorx dorx commented May 29, 2014

Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mengxr
Copy link
Contributor

mengxr commented May 29, 2014

Jenkins, test this please.

@mengxr
Copy link
Contributor

mengxr commented May 29, 2014

Jenkins, add to whitelist.

* @return sample of specified size in an array
*/
def takeSample(withReplacement: Boolean,
num: Int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use 4-space indentation

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@@ -402,10 +411,11 @@ abstract class RDD[T: ClassTag](
}

if (num > initialCount && !withReplacement) {
// special case not covered in computeFraction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sample without replacement, num cannot be greater than initialCount. What is block for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy code to prevent overflow if initialCount = Integer.MAX_VALUE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can really prevent overflow. The fraction is chosen as 3 * INT_MAX / count, which means the expect sample size is 3 * INT_MAX > INT_MAX. So collect() will throw an exception almost surely.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15295/

Reviewer comments addressed:
- commons-math3 is now a test-only dependency. bumped up to v3.3
- comments added to explain what computeFraction is doing
- fixed the unit for computeFraction to use BinomialDitro for without
replacement sampling
- stylistic fixes
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@dorx
Copy link
Contributor Author

dorx commented May 30, 2014

@mengxr do your worst

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15297/

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@mengxr
Copy link
Contributor

mengxr commented Jun 13, 2014

LGTM. Thanks! Waiting for Jenkins ...

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15738/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Build finished.

1 similar comment
@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15743/

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15742/

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15747/

@mengxr
Copy link
Contributor

mengxr commented Jun 13, 2014

Merged. Thanks!

@asfgit asfgit closed this in 1de1d70 Jun 13, 2014
@colorant
Copy link
Contributor

@dorx Do you think this works for extreme large data set with really small sample size? e.g. n = 1.0x10^10 while sample = 1 ? in that case, the final adjusted fraction lead to around 1.2x10^-9, by theory, there are still 99.99 chance to get sample. But since Double also has precision issue, and Random afterall is not true random. so do you think it is enough to guarantee 99.99 chance under this extreme condition? I am wondering about this is because, Actually, in the very case, the original code (3x(1+1)) / total will give a fraction around 6x10^-10, which is just about half size of the new code. And under that fraction value. it keep loop for ever and never did get a chance to return that 1 sample.

@mengxr
Copy link
Contributor

mengxr commented Jun 13, 2014

@colorant Tried the following with the new implementation:

val rdd = sc.parallelize(0 until 1000000000, 10000).flatMap(i => Iterator.fill(10)(0)) // 10^10
rdd.takeSample(false, 1).size
rdd.takeSample(true, 1).size

Both worked well. We might need a better RNG for even smaller sampling probabilities. Another solution is set a lower bound in comptueFractionForSampleSize, e.g, 10^{-9}. I prefer the latter to avoid using expensive RNGs. Could you run some tests and derive a good lower bound? Thanks!

@dorx
Copy link
Contributor Author

dorx commented Jun 13, 2014

@colorant Thanks for taking a look at this!

First of all let me just say that I ran Xiangrui's code but with ".fill(1000)" (so 100x in RDD size), and it was still able to select a sample with exactly one data point in one pass.

So there's a couple things in play here. The smallest resolution handled by a Double is 2^(-1074) ~ 5e-324, so before we run into RDDs of size ~10^323, we in theory won't run into have a sampling rate of 0. Then it comes down to whether the random number generator is truly random and isn't biased against very small numbers. The two experiments Xiangrui and I ran seem to suggest that the java.util.Random object is able to produce small enough random numbers. However, we should definitely further investigate the quality of the RNG used to gauge sampling behavior at even smaller sampling rates.

One thing to note about this implementation is that at higher sampling rates, we are actually able to save memory by not caching as many samples as before in order to be able to guarantee the sample size in one try.

@dorx dorx deleted the takeSample branch June 18, 2014 20:38
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.

Author: Doris Xin <[email protected]>
Author: dorx <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#916 from dorx/takeSample and squashes the following commits:

5b061ae [Doris Xin] merge master
444e750 [Doris Xin] edge cases
3de882b [dorx] Merge pull request apache#2 from mengxr/SPARK-1939
82dde31 [Xiangrui Meng] update pyspark's takeSample
48d954d [Doris Xin] remove unused imports from RDDSuite
fb1452f [Doris Xin] allowing num to be greater than count in all cases
1481b01 [Doris Xin] washing test tubes and making coffee
dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
64e445b [Doris Xin] logwarnning as soon as it enters the while loop
55518ed [Doris Xin] added TODO for logging in rdd.py
eff89e2 [Doris Xin] addressed reviewer comments.
ecab508 [Doris Xin] "fixed checkstyle violation
0a9b3e3 [Doris Xin] "reviewer comment addressed"
f80f270 [Doris Xin] Merge branch 'master' into takeSample
ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
065ebcd [Doris Xin] Merge branch 'master' into takeSample
9bdd36e [Doris Xin] Check sample size and move computeFraction
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.

Author: Doris Xin <[email protected]>
Author: dorx <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#916 from dorx/takeSample and squashes the following commits:

5b061ae [Doris Xin] merge master
444e750 [Doris Xin] edge cases
3de882b [dorx] Merge pull request apache#2 from mengxr/SPARK-1939
82dde31 [Xiangrui Meng] update pyspark's takeSample
48d954d [Doris Xin] remove unused imports from RDDSuite
fb1452f [Doris Xin] allowing num to be greater than count in all cases
1481b01 [Doris Xin] washing test tubes and making coffee
dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
64e445b [Doris Xin] logwarnning as soon as it enters the while loop
55518ed [Doris Xin] added TODO for logging in rdd.py
eff89e2 [Doris Xin] addressed reviewer comments.
ecab508 [Doris Xin] "fixed checkstyle violation
0a9b3e3 [Doris Xin] "reviewer comment addressed"
f80f270 [Doris Xin] Merge branch 'master' into takeSample
ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
065ebcd [Doris Xin] Merge branch 'master' into takeSample
9bdd36e [Doris Xin] Check sample size and move computeFraction
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
Agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
…org.apache.curator.framework.api.ProtectACLCreateModePathAndBytesable org.apache.curator.framework.api.CreateBuilder.creatingParentsIfNeeded()' (apache#916)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants