-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23528][ML] Add numIter to ClusteringSummary #20701
Changes from 4 commits
e3a217c
d5c8af7
b3d0523
8b16af6
f6ee4a2
41f0371
59fef4e
e2f68ac
4a6bd2d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,7 +17,7 @@ | |
|
||
package org.apache.spark.ml.clustering | ||
|
||
import org.apache.spark.annotation.Experimental | ||
import org.apache.spark.annotation.{Experimental, Since} | ||
import org.apache.spark.sql.{DataFrame, Row} | ||
|
||
/** | ||
|
@@ -34,7 +34,8 @@ class ClusteringSummary private[clustering] ( | |
@transient val predictions: DataFrame, | ||
val predictionCol: String, | ||
val featuresCol: String, | ||
val k: Int) extends Serializable { | ||
val k: Int, | ||
@Since("2.4.0") val numIter: Int) extends Serializable { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add this param in the comment above. |
||
|
||
/** | ||
* Cluster centers of the transformed data. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,8 +36,9 @@ import org.apache.spark.sql.{Row, SparkSession} | |
* A clustering model for K-means. Each point belongs to the cluster with the closest center. | ||
*/ | ||
@Since("0.8.0") | ||
class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector], | ||
@Since("2.4.0") val distanceMeasure: String) | ||
class KMeansModel private[spark] (@Since("1.0.0") val clusterCenters: Array[Vector], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So previously the main constructor was not private, any particular reason we are making in private? if someone else is implementing something which extends the kmeans model this might be a little frustrating. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just didn't want the user to be able to create a KMeansModel setting the number of iterations. I moved the other constructor which is still available. I don't have strong reasons against making this public, so I am removing the private clause if you think we best let it to be public. |
||
@Since("2.4.0") val distanceMeasure: String, | ||
private[spark] val numIter: Int) | ||
extends Saveable with Serializable with PMMLExportable { | ||
|
||
private val distanceMeasureInstance: DistanceMeasure = | ||
|
@@ -46,6 +47,10 @@ class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vec | |
private val clusterCentersWithNorm = | ||
if (clusterCenters == null) null else clusterCenters.map(new VectorWithNorm(_)) | ||
|
||
@Since("2.4.0") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I think the correct since annotation here would be 0.8.0 since this is just a move of the previous constructor right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is the right one. 0.8.0 is the annotation for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does this constructor need to be public? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, I will make it private, thanks. |
||
def this(clusterCenters: Array[Vector], distanceMeasure: String) = | ||
this(clusterCenters: Array[Vector], distanceMeasure, -1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So were using -1 to indicate we don't have the numIter information There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, this can happen for instance when reloading a persisted model. Moreover this is only for the mllib model, which as far as I know is suggested not to be used anymore in favor of the new ml api. Any concern/suggestion about this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds reasonable, I personally don't enjoy -1 to indicate lack of information but it seems to be what we have generally used in the past for mllib summary info into ml so my personal feelings aren't important :) |
||
|
||
@Since("1.1.0") | ||
def this(clusterCenters: Array[Vector]) = | ||
this(clusterCenters: Array[Vector], DistanceMeasure.EUCLIDEAN) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,6 +36,11 @@ object MimaExcludes { | |
|
||
// Exclude rules for 2.4.x | ||
lazy val v24excludes = v23excludes ++ Seq( | ||
// [SPARK-23528] Add numIter to ClusteringSummary | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a note for other reviewers/myself these are all private spark constructors |
||
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.clustering.ClusteringSummary.this"), | ||
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.clustering.KMeansSummary.this"), | ||
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.clustering.BisectingKMeansSummary.this"), | ||
|
||
// [SPARK-23412][ML] Add cosine distance measure to BisectingKmeans | ||
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.org$apache$spark$ml$param$shared$HasDistanceMeasure$_setter_$distanceMeasure_="), | ||
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.getDistanceMeasure"), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here (and in the others), we should add this as param in the comment above as done with the other params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing this out, I completely missed it. Thank you, I am adding them.