Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10780][ML] Add an initial model to kmeans #11119

Closed
wants to merge 47 commits into from

Conversation

yinxusen
Copy link
Contributor

@yinxusen yinxusen commented Feb 8, 2016

https://issues.apache.org/jira/browse/SPARK-10780

This PR aims to add warm-start to KMeans algorithm.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50935 has finished for PR 11119 at commit 36b1729.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Feb 8, 2016

re the first question - I don't think this necessarily needs to be a code generated param (although if we do end up having more shared params with templated types we should definitely do codegen). For now maybe just a hand written HasInitialModel seems fine (although I'd put it in a separate file rather than tacking it on the end of the generated code) - but thats just my personal thoughts. Maybe @dbtsai can chime in too?


/** @group setParam */
@Since("2.0.0")
def setInitialModel(value: Model[_]): this.type = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is something we intended to have be a general function, should probably go in the HasInitialModel trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll try to make the setter to HasInitialModel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go into the trait, but the pattern matching will be different tho. Are we just overwriting it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it here for now.

@dbtsai
Copy link
Member

dbtsai commented Feb 9, 2016

Agree, for code-gen, if we want to do it in this way, we would rather put them in a separate place. But will be nice to extend the code-gen framework so it can use one codebase to handle generic type.

+@jkbradley @mengxr BTW, we still need to run the separate sbt code to do code-gen, and why don't we do it in the compile time using quasiquote? This will not hurt the performance since it's compile time.

value match {
case m: KMeansModel => set(initialModel, m)
case other =>
logInfo(s"KMeansModel required but ${other.getClass.getSimpleName} found.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do warning or error.

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51090 has finished for PR 11119 at commit 166a6ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Feb 12, 2016

@yinxusen I'll be away for Spark summit east. Gonna work on this again when I'm back. Thanks.

@yinxusen
Copy link
Contributor Author

Never mind, take your time.

2016年2月12日星期五,DB Tsai [email protected] 写道:

@yinxusen https://github.com/yinxusen I'll be away for Spark summit
east. Gonna work on this again when I'm back. Thanks.


Reply to this email directly or view it on GitHub
#11119 (comment).

Cheers

Xusen Yin (尹绪森)
LinkedIn: https://cn.linkedin.com/in/xusenyin

@yinxusen
Copy link
Contributor Author

Ping @dbtsai Coming back? :)

@dbtsai
Copy link
Member

dbtsai commented Feb 23, 2016

Yes, but busy on work. :( Will soon start it in couple days.

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 7, 2016

Ping @dbtsai

@SparkQA
Copy link

SparkQA commented Mar 7, 2016

Test build #52578 has finished for PR 11119 at commit f56e443.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67156 has finished for PR 11119 at commit e529972.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

@MLnick @dbtsai @sethah Any thoughts on the new version?

@sethah
Copy link
Contributor

sethah commented Oct 21, 2016

Related thought: if the model holds a pointer to its initialModel, then it will be serialized and shipped along with the model at prediction time. This will be inefficient for large models and even if we cut the lineage, we needlessly double the size of the closure.

It seems best not to have the initialModel in the model at all, but then it is an edge case for params, and it's still nice to know how the model was initialized at fit time. Thoughts?

@yinxusen
Copy link
Contributor Author

How about the following:

  1. Since the new generated model is derived from an estimator, the model should have the same params as its parent estimator. That's why there is no need to consider about params of model. We can only care about params of the estimator, as we have already done in the code - the param k and param initMode.
  2. We can treat initialModel as a "transient" variable so that it must be vanished facing serialization, since there is no need to hold initialModel in a new generated model.
  3. We can introduce a dumb model as a placeholder for initialModel, as a consequence, users know the initialization method of the model.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few algorithms where the Model does not contain all of the Params of its Estimator. This has been inconsistent, but I do think it's fine for the KMeansModel not to store the initialModel (except through its parent). Users can identify the initialization method of the model by looking at Model.parent.initialModel.

As far as serializing and shipping the initialModel accidentally, I don't think that has to be an issue. Currently, predictUDF in transform() is likely capturing the whole Model class, but it doesn't have to. We could change it to:

val tmpParent: MLlibKMeansModel = parentModel
val predictUDF = udf((vector: Vector) => tmpParent.predict(features)(vector))

This is an issue throughout spark.ml because of the Predictor abstraction...which should probably be corrected as we add more support for initial models.

As far as saving and loading Models, I agree with your previous statements about not needing to save the initialModel in general. I do want us to save/load Model.parent eventually, at which time we could revisit this issue. But not storing initialModel as a Model Param would avoid this issue.

Also, your discussion have been much longer than this, so it would be great to document decision in a public design doc which others can refer to when adding initialModel to other algorithms.


// Check that the number of clusters are equal
val kOfInitialModel = $(initialModel).parentModel.clusterCenters.length
require(kOfInitialModel == $(k),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend that this log a warning instead of causing a failure. If we use CrossValidator to select amongst initial models, then

@sethah
Copy link
Contributor

sethah commented Oct 24, 2016

@jkbradley Thanks for your thoughts. I agree it's a good idea to change the KMeans prediction function to not use the entire model in its closure, but that we need a more thorough solution when we generalize this to predictors.

Would you mind pointing me to an example of an algorithm which only copies some, but not all, of the estimator params?

Users can identify the initialization method of the model by looking at Model.parent.initialModel.

Sure, but will they? How will they know that kMeansModel.getInitialModel is invalid, and that they should instead call kMeansModel.parent.getInitialModel? Also, there is some coupling between initMode and initialModel. It's misleading to have:

val km = new KMeans().setInitialModel(...)
km.getInitMode
> "k-means||"

It's especially misleading to have a model that says initialModel is unset (or that it doesn't even have an initialModel param), when it really was, AND that the init mode is some value as well. Maybe we should automatically set initMode to something like "initialModel" in the setInitialModel method. That would give the following behavior:

val km = new KMeans().setInitialModel(...)
km.getInitMode
> "initialModel"
val model = km.fit(df)
model.getInitMode
> "initialModel"

That solves the problem of users having to know to access the initialization modes via its parent, and having conflicting initMode and initialModel. This makes sense since setting an initial model is really just another option for initMode.

@jkbradley
Copy link
Member

Would you mind pointing me to an example of an algorithm which only copies some, but not all, of the estimator params?

ALS is a good example: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L98]

[users identifying initialization method]

I agree it's misleading to have mismatched Params initialModel and initMode, especially if Model.initialModel does not exist. I'd say this is an ideal solution:

  • (in this PR) Have setInitialModel also set k, initMode, etc. (where we create a new initMode called "initialModel").
    • Calling setInitMode("initialModel") would probably need to throw an error. This is a minor issue IMO.
  • (in a follow-up PR) The above bullet point has one bigger issue: Setting initialModel via km.set(km.initialModel, initialModel) would bypass the setter method and therefore not set k, initMode, etc. appropriately. This issue with tied Params has appeared elsewhere in MLlib as well. We could implement a fix by having the Params.set method use Scala reflection to call the corresponding setter method. We'd just have to take extra care to test this well.
    • There are some Params in Models without matching setter methods. Those were added with the intention of having Estimator Params easily accessible from Models. We'll just have to keep these in mind when writing unit tests.

@sethah
Copy link
Contributor

sethah commented Oct 24, 2016

Ok, unless anyone has strong objections, it seems our plan moving forward with this PR should be:

  • Change the setInitialModel method to also set initMode to "initialModel"
  • Change the initMode param to support additional value "initialModel"
  • setInitMode throws an error when called with setInitMode("initialModel") and instructs user to use setInitialModel instead
  • Separate KMeansParams to extend KMeansModelParams and have an additional param initialModel
  • Update the read/write logic accordingly
  • Update tests
  • Create a follow up JIRA to address the case of calling set and bypassing the specific setInitialModel method

Let me know if I have missed something. @yinxusen Does this seam reasonable?

@jkbradley
Copy link
Member

setInitMode throws an error when called with setInitMode("initialModel") and instructs user to use setInitialModel instead

On second thought, for this one, it could be good to have it work as long as initialModel is already set.

Otherwise, that plan matches what I have in mind. Thanks!

@sethah
Copy link
Contributor

sethah commented Nov 3, 2016

@yinxusen Status update?

@yinxusen
Copy link
Contributor Author

yinxusen commented Nov 7, 2016

@sethah Sorry, I got stuck in other things. I'll update this PR tonight.

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68325 has finished for PR 11119 at commit 8516a2c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Nov 8, 2016

This is probably going to miss 2.1 since we are officially in QA now, just as an fyi.

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68368 has finished for PR 11119 at commit 6f169eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -414,6 +414,8 @@ object KMeans {
val RANDOM = "random"
@Since("0.8.0")
val K_MEANS_PARALLEL = "k-means||"
@Since("2.1.0")
val K_MEANS_INITIAL_MODEL = "initialModel"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be public? This only serves a purpose when used with ML I think.

logWarning(s"initialModel is set, so initMode will be ignored. Clear initialModel first.")
}
if (value == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
logWarning(s"initMode of $value is not supported here, please use setInitialModel.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the discussion, I think we decided to throw an error for setInitMode("initialModel") if initialModel wasn't already set. If initialModel has been set, then we'd just update the initMode as normal.

def setInitMode(value: String): this.type = set(initMode, value)
def setInitMode(value: String): this.type = {
if (isSet(initialModel)) {
logWarning(s"initialModel is set, so initMode will be ignored. Clear initialModel first.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We say it will be ignored, but then still set it below.

@@ -124,7 +147,8 @@ class KMeansModel private[ml] (
@Since("2.0.0")
override def transform(dataset: Dataset[_]): DataFrame = {
transformSchema(dataset.schema, logging = true)
val predictUDF = udf((vector: Vector) => predict(vector))
val tmpParent: MLlibKMeansModel = parentModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a comment would be useful? // avoid encapsulating the entire model in the closure

if ($(k) != kOfInitialModel) {
val previousK = $(k)
set(k, kOfInitialModel)
logWarning(s"Param K is set to $kOfInitialModel by the initialModel." +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe s"Param k was changed from $previousK to $kOfInitialModel to match the initialModel"

assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
}

test("Infer K from an initial model") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the behavior is getting a bit confusing because we now have three params which are intertwined. For that reason, we should be very thorough on the tests. With my understanding of the behavior we decided on, the following tests should all pass. Can you tell me if it looks right to you?:

test("initialModel params") {
    val initialK = 3
    val initialEstimator = new KMeans()
      .setK(initialK)
    val initialModel = initialEstimator.fit(dataset)

    val km = new KMeans()
      .setK(initialK + 1)
      .setInitialModel(initialModel)

    // intialModel sets k and init mode
    assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)
    assert(km.getK === initialK)
    assert(km.getInitialModel.getK === initialK)

    // setting k is ignored
    km.setK(initialK + 1)
    assert(km.getK === initialK)

    // this should work since we already set initialModel
    km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)

    // this is ignored because initialModel is set
    km.setInitMode(MLlibKMeans.RANDOM)
    assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)

    km.clear(km.initialModel)
    // kmeans now accepts init mode
    km.setInitMode(MLlibKMeans.RANDOM)
    assert(km.getInitMode === MLlibKMeans.RANDOM)
    // kmeans should throw an error since we shouldn't be allowed to set init mode to "initialModel"
    intercept[IllegalArgumentException] {
      km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)
    }
}

* Param for KMeansModel to use for warm start.
* Whenever initialModel is set:
* 1. the initialModel k will override the param k;
* 2. the param initMode is set to initialModel and manually set is ignored;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the param initMode is set to "initialModel" and manually setting initMode will be ignored

nit: Let's just remove the punctuation from the numbered list

* Params for KMeans
*/

private[clustering] trait KMeansInitialModelParams extends HasInitialModel[KMeansModel] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we follow the convention in ALS, then we should have KMeansModelParams and KMeansParams extends KMeansModelParams with .... I think it would be good to do the same here.

* 3. other params are untouched.
* @group param
*/
final val initialModel: Param[KMeansModel] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override final val

@sethah
Copy link
Contributor

sethah commented Nov 18, 2016

@yinxusen I took a look at the updates. Will you be able to create the design doc that Joseph mentioned?

@sethah
Copy link
Contributor

sethah commented Dec 7, 2016

ping?

@sethah
Copy link
Contributor

sethah commented Jan 10, 2017

@yinxusen Do you think you'll have time to work on this?

@sethah
Copy link
Contributor

sethah commented Feb 1, 2017

ping! I could take this over if needed :)

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75004 has finished for PR 11119 at commit 6f169eb.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Do you guys mind if I propose to close this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants