-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1939 Refactor takeSample method in RDD to use ScaSRS #916
Conversation
Can one of the admins verify this patch? |
Jenkins, test this please. |
Jenkins, add to whitelist. |
* @return sample of specified size in an array | ||
*/ | ||
def takeSample(withReplacement: Boolean, | ||
num: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use 4-space indentation
Merged build triggered. |
Merged build started. |
@@ -402,10 +411,11 @@ abstract class RDD[T: ClassTag]( | |||
} | |||
|
|||
if (num > initialCount && !withReplacement) { | |||
// special case not covered in computeFraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sample without replacement, num
cannot be greater than initialCount. What is block for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Legacy code to prevent overflow if initialCount = Integer.MAX_VALUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it can really prevent overflow. The fraction is chosen as 3 * INT_MAX / count
, which means the expect sample size is 3 * INT_MAX > INT_MAX
. So collect()
will throw an exception almost surely.
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15295/ |
Reviewer comments addressed: - commons-math3 is now a test-only dependency. bumped up to v3.3 - comments added to explain what computeFraction is doing - fixed the unit for computeFraction to use BinomialDitro for without replacement sampling - stylistic fixes
Merged build triggered. |
Merged build started. |
@mengxr do your worst |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15297/ |
Build started. |
Build triggered. |
Build started. |
LGTM. Thanks! Waiting for Jenkins ... |
Build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15738/ |
Merged build triggered. |
Merged build started. |
Build finished. |
1 similar comment
Build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15743/ |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15742/ |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Merged. Thanks! |
@dorx Do you think this works for extreme large data set with really small sample size? e.g. n = 1.0x10^10 while sample = 1 ? in that case, the final adjusted fraction lead to around 1.2x10^-9, by theory, there are still 99.99 chance to get sample. But since Double also has precision issue, and Random afterall is not true random. so do you think it is enough to guarantee 99.99 chance under this extreme condition? I am wondering about this is because, Actually, in the very case, the original code (3x(1+1)) / total will give a fraction around 6x10^-10, which is just about half size of the new code. And under that fraction value. it keep loop for ever and never did get a chance to return that 1 sample. |
@colorant Tried the following with the new implementation:
Both worked well. We might need a better RNG for even smaller sampling probabilities. Another solution is set a lower bound in |
@colorant Thanks for taking a look at this! First of all let me just say that I ran Xiangrui's code but with ".fill(1000)" (so 100x in RDD size), and it was still able to select a sample with exactly one data point in one pass. So there's a couple things in play here. The smallest resolution handled by a Double is 2^(-1074) ~ 5e-324, so before we run into RDDs of size ~10^323, we in theory won't run into have a sampling rate of 0. Then it comes down to whether the random number generator is truly random and isn't biased against very small numbers. The two experiments Xiangrui and I ran seem to suggest that the java.util.Random object is able to produce small enough random numbers. However, we should definitely further investigate the quality of the RNG used to gauge sampling behavior at even smaller sampling rates. One thing to note about this implementation is that at higher sampling rates, we are actually able to save memory by not caching as many samples as before in order to be able to guarantee the sample size in one try. |
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate. Author: Doris Xin <[email protected]> Author: dorx <[email protected]> Author: Xiangrui Meng <[email protected]> Closes apache#916 from dorx/takeSample and squashes the following commits: 5b061ae [Doris Xin] merge master 444e750 [Doris Xin] edge cases 3de882b [dorx] Merge pull request apache#2 from mengxr/SPARK-1939 82dde31 [Xiangrui Meng] update pyspark's takeSample 48d954d [Doris Xin] remove unused imports from RDDSuite fb1452f [Doris Xin] allowing num to be greater than count in all cases 1481b01 [Doris Xin] washing test tubes and making coffee dc699f3 [Doris Xin] give back imports removed by accident in rdd.py 64e445b [Doris Xin] logwarnning as soon as it enters the while loop 55518ed [Doris Xin] added TODO for logging in rdd.py eff89e2 [Doris Xin] addressed reviewer comments. ecab508 [Doris Xin] "fixed checkstyle violation 0a9b3e3 [Doris Xin] "reviewer comment addressed" f80f270 [Doris Xin] Merge branch 'master' into takeSample ae3ad04 [Doris Xin] fixed edge cases to prevent overflow 065ebcd [Doris Xin] Merge branch 'master' into takeSample 9bdd36e [Doris Xin] Check sample size and move computeFraction e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample 7cab53a [Doris Xin] fixed import bug in rdd.py ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate. Author: Doris Xin <[email protected]> Author: dorx <[email protected]> Author: Xiangrui Meng <[email protected]> Closes apache#916 from dorx/takeSample and squashes the following commits: 5b061ae [Doris Xin] merge master 444e750 [Doris Xin] edge cases 3de882b [dorx] Merge pull request apache#2 from mengxr/SPARK-1939 82dde31 [Xiangrui Meng] update pyspark's takeSample 48d954d [Doris Xin] remove unused imports from RDDSuite fb1452f [Doris Xin] allowing num to be greater than count in all cases 1481b01 [Doris Xin] washing test tubes and making coffee dc699f3 [Doris Xin] give back imports removed by accident in rdd.py 64e445b [Doris Xin] logwarnning as soon as it enters the while loop 55518ed [Doris Xin] added TODO for logging in rdd.py eff89e2 [Doris Xin] addressed reviewer comments. ecab508 [Doris Xin] "fixed checkstyle violation 0a9b3e3 [Doris Xin] "reviewer comment addressed" f80f270 [Doris Xin] Merge branch 'master' into takeSample ae3ad04 [Doris Xin] fixed edge cases to prevent overflow 065ebcd [Doris Xin] Merge branch 'master' into takeSample 9bdd36e [Doris Xin] Check sample size and move computeFraction e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample 7cab53a [Doris Xin] fixed import bug in rdd.py ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
This reverts commit 8a7c1a9.
…org.apache.curator.framework.api.ProtectACLCreateModePathAndBytesable org.apache.curator.framework.api.CreateBuilder.creatingParentsIfNeeded()' (apache#916)
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.