[SPARK-10780][ML] Add an initial model to kmeans #11119

yinxusen · 2016-02-08T20:16:46Z

https://issues.apache.org/jira/browse/SPARK-10780

This PR aims to add warm-start to KMeans algorithm.

SparkQA · 2016-02-08T21:17:13Z

Test build #50935 has finished for PR 11119 at commit 36b1729.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-02-08T21:24:57Z

re the first question - I don't think this necessarily needs to be a code generated param (although if we do end up having more shared params with templated types we should definitely do codegen). For now maybe just a hand written HasInitialModel seems fine (although I'd put it in a separate file rather than tacking it on the end of the generated code) - but thats just my personal thoughts. Maybe @dbtsai can chime in too?

holdenk · 2016-02-08T21:50:03Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+
+  /** @group setParam */
+  @Since("2.0.0")
+  def setInitialModel(value: Model[_]): this.type = {


If this is something we intended to have be a general function, should probably go in the HasInitialModel trait.

Sure, I'll try to make the setter to HasInitialModel

This can go into the trait, but the pattern matching will be different tho. Are we just overwriting it here?

We can leave it here for now.

dbtsai · 2016-02-09T08:54:33Z

Agree, for code-gen, if we want to do it in this way, we would rather put them in a separate place. But will be nice to extend the code-gen framework so it can use one codebase to handle generic type.

+@jkbradley @mengxr BTW, we still need to run the separate sbt code to do code-gen, and why don't we do it in the compile time using quasiquote? This will not hurt the performance since it's compile time.

dbtsai · 2016-02-09T08:59:46Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+    value match {
+      case m: KMeansModel => set(initialModel, m)
+      case other =>
+        logInfo(s"KMeansModel required but ${other.getClass.getSimpleName} found.")


Let's do warning or error.

…l types

SparkQA · 2016-02-11T08:56:30Z

Test build #51090 has finished for PR 11119 at commit 166a6ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-02-12T09:27:00Z

@yinxusen I'll be away for Spark summit east. Gonna work on this again when I'm back. Thanks.

yinxusen · 2016-02-12T09:29:05Z

Never mind, take your time.

2016年2月12日星期五，DB Tsai [email protected] 写道：

@yinxusen https://github.com/yinxusen I'll be away for Spark summit
east. Gonna work on this again when I'm back. Thanks.

—
Reply to this email directly or view it on GitHub
#11119 (comment).

Cheers

Xusen Yin (尹绪森)
LinkedIn: https://cn.linkedin.com/in/xusenyin

yinxusen · 2016-02-22T20:13:43Z

Ping @dbtsai Coming back? :)

dbtsai · 2016-02-23T08:19:58Z

Yes, but busy on work. :( Will soon start it in couple days.

yinxusen · 2016-03-07T18:58:31Z

Ping @dbtsai

SparkQA · 2016-03-07T19:04:12Z

Test build #52578 has finished for PR 11119 at commit f56e443.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-19T02:19:00Z

Test build #67156 has finished for PR 11119 at commit e529972.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2016-10-20T00:36:19Z

@MLnick @dbtsai @sethah Any thoughts on the new version?

sethah · 2016-10-21T02:19:03Z

Related thought: if the model holds a pointer to its initialModel, then it will be serialized and shipped along with the model at prediction time. This will be inefficient for large models and even if we cut the lineage, we needlessly double the size of the closure.

It seems best not to have the initialModel in the model at all, but then it is an edge case for params, and it's still nice to know how the model was initialized at fit time. Thoughts?

yinxusen · 2016-10-23T18:50:59Z

How about the following:

Since the new generated model is derived from an estimator, the model should have the same params as its parent estimator. That's why there is no need to consider about params of model. We can only care about params of the estimator, as we have already done in the code - the param k and param initMode.
We can treat initialModel as a "transient" variable so that it must be vanished facing serialization, since there is no need to hold initialModel in a new generated model.
We can introduce a dumb model as a placeholder for initialModel, as a consequence, users know the initialization method of the model.

jkbradley

There are quite a few algorithms where the Model does not contain all of the Params of its Estimator. This has been inconsistent, but I do think it's fine for the KMeansModel not to store the initialModel (except through its parent). Users can identify the initialization method of the model by looking at Model.parent.initialModel.

As far as serializing and shipping the initialModel accidentally, I don't think that has to be an issue. Currently, predictUDF in transform() is likely capturing the whole Model class, but it doesn't have to. We could change it to:

val tmpParent: MLlibKMeansModel = parentModel
val predictUDF = udf((vector: Vector) => tmpParent.predict(features)(vector))

This is an issue throughout spark.ml because of the Predictor abstraction...which should probably be corrected as we add more support for initial models.

As far as saving and loading Models, I agree with your previous statements about not needing to save the initialModel in general. I do want us to save/load Model.parent eventually, at which time we could revisit this issue. But not storing initialModel as a Model Param would avoid this issue.

Also, your discussion have been much longer than this, so it would be great to document decision in a public design doc which others can refer to when adding initialModel to other algorithms.

jkbradley · 2016-10-24T16:53:45Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+
+      // Check that the number of clusters are equal
+      val kOfInitialModel = $(initialModel).parentModel.clusterCenters.length
+      require(kOfInitialModel == $(k),


I'd recommend that this log a warning instead of causing a failure. If we use CrossValidator to select amongst initial models, then

sethah · 2016-10-24T18:49:20Z

@jkbradley Thanks for your thoughts. I agree it's a good idea to change the KMeans prediction function to not use the entire model in its closure, but that we need a more thorough solution when we generalize this to predictors.

Would you mind pointing me to an example of an algorithm which only copies some, but not all, of the estimator params?

Users can identify the initialization method of the model by looking at Model.parent.initialModel.

Sure, but will they? How will they know that kMeansModel.getInitialModel is invalid, and that they should instead call kMeansModel.parent.getInitialModel? Also, there is some coupling between initMode and initialModel. It's misleading to have:

val km = new KMeans().setInitialModel(...)
km.getInitMode
> "k-means||"

It's especially misleading to have a model that says initialModel is unset (or that it doesn't even have an initialModel param), when it really was, AND that the init mode is some value as well. Maybe we should automatically set initMode to something like "initialModel" in the setInitialModel method. That would give the following behavior:

val km = new KMeans().setInitialModel(...)
km.getInitMode
> "initialModel"
val model = km.fit(df)
model.getInitMode
> "initialModel"

That solves the problem of users having to know to access the initialization modes via its parent, and having conflicting initMode and initialModel. This makes sense since setting an initial model is really just another option for initMode.

jkbradley · 2016-10-24T20:36:50Z

Would you mind pointing me to an example of an algorithm which only copies some, but not all, of the estimator params?

ALS is a good example: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L98]

[users identifying initialization method]

I agree it's misleading to have mismatched Params initialModel and initMode, especially if Model.initialModel does not exist. I'd say this is an ideal solution:

(in this PR) Have setInitialModel also set k, initMode, etc. (where we create a new initMode called "initialModel").
- Calling setInitMode("initialModel") would probably need to throw an error. This is a minor issue IMO.
(in a follow-up PR) The above bullet point has one bigger issue: Setting initialModel via km.set(km.initialModel, initialModel) would bypass the setter method and therefore not set k, initMode, etc. appropriately. This issue with tied Params has appeared elsewhere in MLlib as well. We could implement a fix by having the Params.set method use Scala reflection to call the corresponding setter method. We'd just have to take extra care to test this well.
- There are some Params in Models without matching setter methods. Those were added with the intention of having Estimator Params easily accessible from Models. We'll just have to keep these in mind when writing unit tests.

sethah · 2016-10-24T22:52:40Z

Ok, unless anyone has strong objections, it seems our plan moving forward with this PR should be:

Change the setInitialModel method to also set initMode to "initialModel"
Change the initMode param to support additional value "initialModel"
setInitMode throws an error when called with setInitMode("initialModel") and instructs user to use setInitialModel instead
Separate KMeansParams to extend KMeansModelParams and have an additional param initialModel
Update the read/write logic accordingly
Update tests
Create a follow up JIRA to address the case of calling set and bypassing the specific setInitialModel method

Let me know if I have missed something. @yinxusen Does this seam reasonable?

jkbradley · 2016-10-25T00:08:23Z

setInitMode throws an error when called with setInitMode("initialModel") and instructs user to use setInitialModel instead

On second thought, for this one, it could be good to have it work as long as initialModel is already set.

Otherwise, that plan matches what I have in mind. Thanks!

sethah · 2016-11-03T15:50:17Z

@yinxusen Status update?

yinxusen · 2016-11-07T17:49:28Z

@sethah Sorry, I got stuck in other things. I'll update this PR tonight.

SparkQA · 2016-11-08T06:53:53Z

Test build #68325 has finished for PR 11119 at commit 8516a2c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-11-08T15:49:45Z

This is probably going to miss 2.1 since we are officially in QA now, just as an fyi.

SparkQA · 2016-11-09T00:27:37Z

Test build #68368 has finished for PR 11119 at commit 6f169eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-11-18T18:36:53Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

@@ -414,6 +414,8 @@ object KMeans {
  val RANDOM = "random"
  @Since("0.8.0")
  val K_MEANS_PARALLEL = "k-means||"
+  @Since("2.1.0")
+  val K_MEANS_INITIAL_MODEL = "initialModel"


Does it need to be public? This only serves a purpose when used with ML I think.

sethah · 2016-11-18T18:38:28Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+      logWarning(s"initialModel is set, so initMode will be ignored. Clear initialModel first.")
+    }
+    if (value == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+      logWarning(s"initMode of $value is not supported here, please use setInitialModel.")


From the discussion, I think we decided to throw an error for setInitMode("initialModel") if initialModel wasn't already set. If initialModel has been set, then we'd just update the initMode as normal.

sethah · 2016-11-18T18:39:53Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

-  def setInitMode(value: String): this.type = set(initMode, value)
+  def setInitMode(value: String): this.type = {
+    if (isSet(initialModel)) {
+      logWarning(s"initialModel is set, so initMode will be ignored. Clear initialModel first.")


We say it will be ignored, but then still set it below.

sethah · 2016-11-18T18:44:59Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

@@ -124,7 +147,8 @@ class KMeansModel private[ml] (
  @Since("2.0.0")
  override def transform(dataset: Dataset[_]): DataFrame = {
    transformSchema(dataset.schema, logging = true)
-    val predictUDF = udf((vector: Vector) => predict(vector))
+    val tmpParent: MLlibKMeansModel = parentModel


maybe a comment would be useful? // avoid encapsulating the entire model in the closure

sethah · 2016-11-18T18:48:34Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+      if ($(k) != kOfInitialModel) {
+        val previousK = $(k)
+        set(k, kOfInitialModel)
+        logWarning(s"Param K is set to $kOfInitialModel by the initialModel." +


nit: Maybe s"Param k was changed from $previousK to $kOfInitialModel to match the initialModel"

sethah · 2016-11-18T19:26:11Z

mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala

+    assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model") {


So, the behavior is getting a bit confusing because we now have three params which are intertwined. For that reason, we should be very thorough on the tests. With my understanding of the behavior we decided on, the following tests should all pass. Can you tell me if it looks right to you?:

test("initialModel params") { val initialK = 3 val initialEstimator = new KMeans() .setK(initialK) val initialModel = initialEstimator.fit(dataset) val km = new KMeans() .setK(initialK + 1) .setInitialModel(initialModel) // intialModel sets k and init mode assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL) assert(km.getK === initialK) assert(km.getInitialModel.getK === initialK) // setting k is ignored km.setK(initialK + 1) assert(km.getK === initialK) // this should work since we already set initialModel km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL) // this is ignored because initialModel is set km.setInitMode(MLlibKMeans.RANDOM) assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL) km.clear(km.initialModel) // kmeans now accepts init mode km.setInitMode(MLlibKMeans.RANDOM) assert(km.getInitMode === MLlibKMeans.RANDOM) // kmeans should throw an error since we shouldn't be allowed to set init mode to "initialModel" intercept[IllegalArgumentException] { km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL) } }

sethah · 2016-11-18T19:39:15Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+   * Param for KMeansModel to use for warm start.
+   * Whenever initialModel is set:
+   *   1. the initialModel k will override the param k;
+   *   2. the param initMode is set to initialModel and manually set is ignored;


the param initMode is set to "initialModel" and manually setting initMode will be ignored

nit: Let's just remove the punctuation from the numbered list

sethah · 2016-11-18T19:41:26Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+ * Params for KMeans
+ */
+
+private[clustering] trait KMeansInitialModelParams extends HasInitialModel[KMeansModel] {


If we follow the convention in ALS, then we should have KMeansModelParams and KMeansParams extends KMeansModelParams with .... I think it would be good to do the same here.

sethah · 2016-11-18T19:41:41Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+   *   3. other params are untouched.
+   * @group param
+   */
+  final val initialModel: Param[KMeansModel] =


override final val

sethah · 2016-11-18T20:30:29Z

@yinxusen I took a look at the updates. Will you be able to create the design doc that Joseph mentioned?

sethah · 2016-12-07T05:13:23Z

ping?

sethah · 2017-01-10T02:29:53Z

@yinxusen Do you think you'll have time to work on this?

sethah · 2017-02-01T00:49:25Z

ping! I could take this over if needed :)

SparkQA · 2017-03-22T00:53:57Z

Test build #75004 has finished for PR 11119 at commit 6f169eb.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-05-04T04:38:48Z

Do you guys mind if I propose to close this PR?

yinxusen added 2 commits February 8, 2016 11:32

add initial model to kmeans

cc13c1e

add two setters for initial model

36b1729

holdenk reviewed Feb 8, 2016
View reviewed changes

dbtsai reviewed Feb 9, 2016
View reviewed changes

yinxusen added 7 commits February 9, 2016 16:00

revert to previous codegen and add a separate sharedParams for genera…

125ac76

…l types

add model check

658c4c9

add more testsuite

abfe0e2

add new save/load to kmeans

b87e07e

add new model save/load for KMeansModel

9a4a55e

fix side effect

65f4237

add hashcode and equals

166a6ff

sethah mentioned this pull request Mar 7, 2016

[SPARK-13025] Allow users to set initial model in logistic regression #11459

Closed

merge with master

f56e443

add }

f3f9226

jkbradley reviewed Oct 24, 2016

View reviewed changes

yinxusen added 3 commits November 7, 2016 21:40

merge with master

939ebe5

first update KMeans

7046913

update test

8516a2c

fix mima test

6f169eb

sethah reviewed Nov 18, 2016

View reviewed changes

yanboliang mentioned this pull request Mar 1, 2017

[SPARK-10780][ML] Support initial model for KMeans. #17117

Closed

HyukjinKwon mentioned this pull request May 4, 2017

[INFRA] Close stale PRs #17855

Closed

asfgit closed this in 4411ac7 May 5, 2017

yanboliang mentioned this pull request Jul 12, 2017

[SPARK-21386] ML LinearRegression supports warm start from user provided initial model. #18610

Closed

[SPARK-10780][ML] Add an initial model to kmeans #11119

[SPARK-10780][ML] Add an initial model to kmeans #11119

Conversation

yinxusen commented Feb 8, 2016 • edited Loading

SparkQA commented Feb 8, 2016

holdenk commented Feb 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbtsai commented Feb 9, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 11, 2016

dbtsai commented Feb 12, 2016

yinxusen commented Feb 12, 2016

Cheers

yinxusen commented Feb 22, 2016

dbtsai commented Feb 23, 2016

yinxusen commented Mar 7, 2016

SparkQA commented Mar 7, 2016

SparkQA commented Oct 19, 2016

yinxusen commented Oct 20, 2016

sethah commented Oct 21, 2016

yinxusen commented Oct 23, 2016

jkbradley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah commented Oct 24, 2016

jkbradley commented Oct 24, 2016

sethah commented Oct 24, 2016

jkbradley commented Oct 25, 2016

sethah commented Nov 3, 2016

yinxusen commented Nov 7, 2016

SparkQA commented Nov 8, 2016

sethah commented Nov 8, 2016

SparkQA commented Nov 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah commented Nov 18, 2016

sethah commented Dec 7, 2016

sethah commented Jan 10, 2017

sethah commented Feb 1, 2017

SparkQA commented Mar 22, 2017

HyukjinKwon commented May 4, 2017

yinxusen commented Feb 8, 2016 •

edited

Loading