-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed #3610
Conversation
Can one of the admins verify this patch? |
@@ -43,7 +43,8 @@ class KMeans private ( | |||
private var runs: Int, | |||
private var initializationMode: String, | |||
private var initializationSteps: Int, | |||
private var epsilon: Double) extends Serializable with Logging { | |||
private var epsilon: Double, | |||
private var seed: Long = System.nanoTime()) extends Serializable with Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you set the default in the one public constructor instead since that's where other defaults are set?
@nxwhite-str Thanks for the PR! Could you please update the title to start with "[SPARK-4749] [mllib]" to help with automated tagging? |
* @param initializationMode initialization model, either "random" or "k-means||" (default). | ||
* @param seed random seed value for cluster initialization | ||
*/ | ||
def train( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this to the beginning and make the one without seed call this?
LGTM except minor inline comments. |
ok to test |
Test build #24654 has started for PR 3610 at commit
|
Test build #24654 has finished for PR 3610 at commit
|
Test FAILed. |
failure in a streaming test...retesting |
Test build #551 has started for PR 3610 at commit
|
|
||
/** | ||
* Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1, | ||
* initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4}. | ||
* initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4, System.nanoTime()}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add parameter name: seed: System.nanoTime()
Test build #551 has finished for PR 3610 at commit
|
@nxwhite-str There are few minor comments left. Do you have time to update the PR? |
minor updates to apache#3610
Test build #25891 has started for PR 3610 at commit
|
Test build #25891 has finished for PR 3610 at commit
|
Test PASSed. |
Merged into master. Thanks! |
This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark Author: nate.crosswhite <[email protected]> Author: nxwhite-str <[email protected]> Author: Xiangrui Meng <[email protected]> Closes apache#3610 from nxwhite-str/master and squashes the following commits: a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed 7668124 [Xiangrui Meng] minor updates f8d5928 [nate.crosswhite] Addressing PR issues 277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master' 9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master' 5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test 616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master' 35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API
This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark