-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10780][ML] Add an initial model to kmeans #11119
Conversation
Test build #50935 has finished for PR 11119 at commit
|
re the first question - I don't think this necessarily needs to be a code generated param (although if we do end up having more shared params with templated types we should definitely do codegen). For now maybe just a hand written HasInitialModel seems fine (although I'd put it in a separate file rather than tacking it on the end of the generated code) - but thats just my personal thoughts. Maybe @dbtsai can chime in too? |
|
||
/** @group setParam */ | ||
@Since("2.0.0") | ||
def setInitialModel(value: Model[_]): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is something we intended to have be a general function, should probably go in the HasInitialModel trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll try to make the setter to HasInitialModel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can go into the trait, but the pattern matching will be different tho. Are we just overwriting it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave it here for now.
Agree, for code-gen, if we want to do it in this way, we would rather put them in a separate place. But will be nice to extend the code-gen framework so it can use one codebase to handle generic type. +@jkbradley @mengxr BTW, we still need to run the separate |
value match { | ||
case m: KMeansModel => set(initialModel, m) | ||
case other => | ||
logInfo(s"KMeansModel required but ${other.getClass.getSimpleName} found.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do warning or error.
Test build #51090 has finished for PR 11119 at commit
|
@yinxusen I'll be away for Spark summit east. Gonna work on this again when I'm back. Thanks. |
Never mind, take your time. 2016年2月12日星期五,DB Tsai [email protected] 写道:
CheersXusen Yin (尹绪森) |
Ping @dbtsai Coming back? :) |
Yes, but busy on work. :( Will soon start it in couple days. |
Ping @dbtsai |
Test build #52578 has finished for PR 11119 at commit
|
Test build #67156 has finished for PR 11119 at commit
|
Related thought: if the model holds a pointer to its initialModel, then it will be serialized and shipped along with the model at prediction time. This will be inefficient for large models and even if we cut the lineage, we needlessly double the size of the closure. It seems best not to have the initialModel in the model at all, but then it is an edge case for params, and it's still nice to know how the model was initialized at fit time. Thoughts? |
How about the following:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few algorithms where the Model does not contain all of the Params of its Estimator. This has been inconsistent, but I do think it's fine for the KMeansModel not to store the initialModel (except through its parent). Users can identify the initialization method of the model by looking at Model.parent.initialModel.
As far as serializing and shipping the initialModel accidentally, I don't think that has to be an issue. Currently, predictUDF in transform() is likely capturing the whole Model class, but it doesn't have to. We could change it to:
val tmpParent: MLlibKMeansModel = parentModel
val predictUDF = udf((vector: Vector) => tmpParent.predict(features)(vector))
This is an issue throughout spark.ml because of the Predictor abstraction...which should probably be corrected as we add more support for initial models.
As far as saving and loading Models, I agree with your previous statements about not needing to save the initialModel in general. I do want us to save/load Model.parent eventually, at which time we could revisit this issue. But not storing initialModel as a Model Param would avoid this issue.
Also, your discussion have been much longer than this, so it would be great to document decision in a public design doc which others can refer to when adding initialModel to other algorithms.
|
||
// Check that the number of clusters are equal | ||
val kOfInitialModel = $(initialModel).parentModel.clusterCenters.length | ||
require(kOfInitialModel == $(k), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend that this log a warning instead of causing a failure. If we use CrossValidator to select amongst initial models, then
@jkbradley Thanks for your thoughts. I agree it's a good idea to change the KMeans prediction function to not use the entire model in its closure, but that we need a more thorough solution when we generalize this to predictors. Would you mind pointing me to an example of an algorithm which only copies some, but not all, of the estimator params?
Sure, but will they? How will they know that val km = new KMeans().setInitialModel(...)
km.getInitMode
> "k-means||" It's especially misleading to have a model that says val km = new KMeans().setInitialModel(...)
km.getInitMode
> "initialModel"
val model = km.fit(df)
model.getInitMode
> "initialModel" That solves the problem of users having to know to access the initialization modes via its parent, and having conflicting |
ALS is a good example: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L98]
I agree it's misleading to have mismatched Params initialModel and initMode, especially if Model.initialModel does not exist. I'd say this is an ideal solution:
|
Ok, unless anyone has strong objections, it seems our plan moving forward with this PR should be:
Let me know if I have missed something. @yinxusen Does this seam reasonable? |
On second thought, for this one, it could be good to have it work as long as initialModel is already set. Otherwise, that plan matches what I have in mind. Thanks! |
@yinxusen Status update? |
@sethah Sorry, I got stuck in other things. I'll update this PR tonight. |
Test build #68325 has finished for PR 11119 at commit
|
This is probably going to miss 2.1 since we are officially in QA now, just as an fyi. |
Test build #68368 has finished for PR 11119 at commit
|
@@ -414,6 +414,8 @@ object KMeans { | |||
val RANDOM = "random" | |||
@Since("0.8.0") | |||
val K_MEANS_PARALLEL = "k-means||" | |||
@Since("2.1.0") | |||
val K_MEANS_INITIAL_MODEL = "initialModel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be public? This only serves a purpose when used with ML I think.
logWarning(s"initialModel is set, so initMode will be ignored. Clear initialModel first.") | ||
} | ||
if (value == MLlibKMeans.K_MEANS_INITIAL_MODEL) { | ||
logWarning(s"initMode of $value is not supported here, please use setInitialModel.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the discussion, I think we decided to throw an error for setInitMode("initialModel")
if initialModel wasn't already set. If initialModel has been set, then we'd just update the initMode as normal.
def setInitMode(value: String): this.type = set(initMode, value) | ||
def setInitMode(value: String): this.type = { | ||
if (isSet(initialModel)) { | ||
logWarning(s"initialModel is set, so initMode will be ignored. Clear initialModel first.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We say it will be ignored, but then still set it below.
@@ -124,7 +147,8 @@ class KMeansModel private[ml] ( | |||
@Since("2.0.0") | |||
override def transform(dataset: Dataset[_]): DataFrame = { | |||
transformSchema(dataset.schema, logging = true) | |||
val predictUDF = udf((vector: Vector) => predict(vector)) | |||
val tmpParent: MLlibKMeansModel = parentModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a comment would be useful? // avoid encapsulating the entire model in the closure
if ($(k) != kOfInitialModel) { | ||
val previousK = $(k) | ||
set(k, kOfInitialModel) | ||
logWarning(s"Param K is set to $kOfInitialModel by the initialModel." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe s"Param k was changed from $previousK to $kOfInitialModel to match the initialModel"
assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) | ||
} | ||
|
||
test("Infer K from an initial model") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the behavior is getting a bit confusing because we now have three params which are intertwined. For that reason, we should be very thorough on the tests. With my understanding of the behavior we decided on, the following tests should all pass. Can you tell me if it looks right to you?:
test("initialModel params") {
val initialK = 3
val initialEstimator = new KMeans()
.setK(initialK)
val initialModel = initialEstimator.fit(dataset)
val km = new KMeans()
.setK(initialK + 1)
.setInitialModel(initialModel)
// intialModel sets k and init mode
assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)
assert(km.getK === initialK)
assert(km.getInitialModel.getK === initialK)
// setting k is ignored
km.setK(initialK + 1)
assert(km.getK === initialK)
// this should work since we already set initialModel
km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)
// this is ignored because initialModel is set
km.setInitMode(MLlibKMeans.RANDOM)
assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)
km.clear(km.initialModel)
// kmeans now accepts init mode
km.setInitMode(MLlibKMeans.RANDOM)
assert(km.getInitMode === MLlibKMeans.RANDOM)
// kmeans should throw an error since we shouldn't be allowed to set init mode to "initialModel"
intercept[IllegalArgumentException] {
km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)
}
}
* Param for KMeansModel to use for warm start. | ||
* Whenever initialModel is set: | ||
* 1. the initialModel k will override the param k; | ||
* 2. the param initMode is set to initialModel and manually set is ignored; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the param initMode is set to "initialModel" and manually setting initMode will be ignored
nit: Let's just remove the punctuation from the numbered list
* Params for KMeans | ||
*/ | ||
|
||
private[clustering] trait KMeansInitialModelParams extends HasInitialModel[KMeansModel] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we follow the convention in ALS, then we should have KMeansModelParams
and KMeansParams extends KMeansModelParams with ...
. I think it would be good to do the same here.
* 3. other params are untouched. | ||
* @group param | ||
*/ | ||
final val initialModel: Param[KMeansModel] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override final val
@yinxusen I took a look at the updates. Will you be able to create the design doc that Joseph mentioned? |
ping? |
@yinxusen Do you think you'll have time to work on this? |
ping! I could take this over if needed :) |
Test build #75004 has finished for PR 11119 at commit
|
Do you guys mind if I propose to close this PR? |
https://issues.apache.org/jira/browse/SPARK-10780
This PR aims to add warm-start to KMeans algorithm.