Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23528][ML] Add numIter to ClusteringSummary #20701

Closed
wants to merge 9 commits into from
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ class BisectingKMeans @Since("2.0.0") (
val parentModel = bkm.run(rdd)
val model = copyValues(new BisectingKMeansModel(uid, parentModel).setParent(this))
val summary = new BisectingKMeansSummary(
model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
model.transform(dataset), $(predictionCol), $(featuresCol), $(k), $(maxIter))
model.setSummary(Some(summary))
instr.logSuccess(model)
model
Expand Down Expand Up @@ -312,4 +312,5 @@ class BisectingKMeansSummary private[clustering] (
predictions: DataFrame,
predictionCol: String,
featuresCol: String,
k: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k)
k: Int,
numIter: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k, numIter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here (and in the others), we should add this as param in the comment above as done with the other params

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing this out, I completely missed it. Thank you, I am adding them.

Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

package org.apache.spark.ml.clustering

import org.apache.spark.annotation.Experimental
import org.apache.spark.annotation.{Experimental, Since}
import org.apache.spark.sql.{DataFrame, Row}

/**
Expand All @@ -34,7 +34,8 @@ class ClusteringSummary private[clustering] (
@transient val predictions: DataFrame,
val predictionCol: String,
val featuresCol: String,
val k: Int) extends Serializable {
val k: Int,
@Since("2.4.0") val numIter: Int) extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this param in the comment above.


/**
* Cluster centers of the transformed data.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -421,7 +421,7 @@ class GaussianMixture @Since("2.0.0") (

val model = copyValues(new GaussianMixtureModel(uid, weights, gaussianDists)).setParent(this)
val summary = new GaussianMixtureSummary(model.transform(dataset),
$(predictionCol), $(probabilityCol), $(featuresCol), $(k), logLikelihood)
$(predictionCol), $(probabilityCol), $(featuresCol), $(k), logLikelihood, iter)
model.setSummary(Some(summary))
instr.logSuccess(model)
model
Expand Down Expand Up @@ -692,8 +692,9 @@ class GaussianMixtureSummary private[clustering] (
@Since("2.0.0") val probabilityCol: String,
featuresCol: String,
k: Int,
@Since("2.2.0") val logLikelihood: Double)
extends ClusteringSummary(predictions, predictionCol, featuresCol, k) {
@Since("2.2.0") val logLikelihood: Double,
numIter: Int)
extends ClusteringSummary(predictions, predictionCol, featuresCol, k, numIter) {

/**
* Probability of each cluster.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ class KMeans @Since("1.5.0") (
val parentModel = algo.run(instances, Option(instr))
val model = copyValues(new KMeansModel(uid, parentModel).setParent(this))
val summary = new KMeansSummary(
model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
model.transform(dataset), $(predictionCol), $(featuresCol), $(k), parentModel.numIter)

model.setSummary(Some(summary))
instr.logSuccess(model)
Expand Down Expand Up @@ -370,4 +370,5 @@ class KMeansSummary private[clustering] (
predictions: DataFrame,
predictionCol: String,
featuresCol: String,
k: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k)
k: Int,
numIter: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k, numIter)
Original file line number Diff line number Diff line change
Expand Up @@ -348,7 +348,7 @@ class KMeans private (

logInfo(s"The cost is $cost.")

new KMeansModel(centers.map(_.vector), distanceMeasure)
new KMeansModel(centers.map(_.vector), distanceMeasure, iteration)
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ import org.apache.spark.sql.{Row, SparkSession}
* A clustering model for K-means. Each point belongs to the cluster with the closest center.
*/
@Since("0.8.0")
class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector],
@Since("2.4.0") val distanceMeasure: String)
class KMeansModel private[spark] (@Since("1.0.0") val clusterCenters: Array[Vector],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So previously the main constructor was not private, any particular reason we are making in private? if someone else is implementing something which extends the kmeans model this might be a little frustrating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just didn't want the user to be able to create a KMeansModel setting the number of iterations. I moved the other constructor which is still available. I don't have strong reasons against making this public, so I am removing the private clause if you think we best let it to be public.

@Since("2.4.0") val distanceMeasure: String,
private[spark] val numIter: Int)
extends Saveable with Serializable with PMMLExportable {

private val distanceMeasureInstance: DistanceMeasure =
Expand All @@ -46,6 +47,10 @@ class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vec
private val clusterCentersWithNorm =
if (clusterCenters == null) null else clusterCenters.map(new VectorWithNorm(_))

@Since("2.4.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the correct since annotation here would be 0.8.0 since this is just a move of the previous constructor right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right one. 0.8.0 is the annotation for the KMeansModel class, while the previous main constructor was added (by me) is a previous PR for 2.4.0 in order to add the distanceMeasure variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this constructor need to be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I will make it private, thanks.

def this(clusterCenters: Array[Vector], distanceMeasure: String) =
this(clusterCenters: Array[Vector], distanceMeasure, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So were using -1 to indicate we don't have the numIter information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this can happen for instance when reloading a persisted model. Moreover this is only for the mllib model, which as far as I know is suggested not to be used anymore in favor of the new ml api. Any concern/suggestion about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, I personally don't enjoy -1 to indicate lack of information but it seems to be what we have generally used in the past for mllib summary info into ml so my personal feelings aren't important :)


@Since("1.1.0")
def this(clusterCenters: Array[Vector]) =
this(clusterCenters: Array[Vector], DistanceMeasure.EUCLIDEAN)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ class BisectingKMeansSuite
test("fit, transform and summary") {
val predictionColName = "bisecting_kmeans_prediction"
val bkm = new BisectingKMeans().setK(k).setPredictionCol(predictionColName).setSeed(1)

val model = bkm.fit(dataset)
assert(model.clusterCenters.length === k)

Expand Down Expand Up @@ -129,6 +130,7 @@ class BisectingKMeansSuite
assert(clusterSizes.length === k)
assert(clusterSizes.sum === numRows)
assert(clusterSizes.forall(_ >= 0))
assert(summary.numIter == 20)

model.setSummary(None)
assert(!model.hasSummary)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ class GaussianMixtureSuite extends SparkFunSuite with MLlibTestSparkContext
assert(clusterSizes.length === k)
assert(clusterSizes.sum === numRows)
assert(clusterSizes.forall(_ >= 0))
assert(summary.numIter == 2)

model.setSummary(None)
assert(!model.hasSummary)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
assert(clusterSizes.length === k)
assert(clusterSizes.sum === numRows)
assert(clusterSizes.forall(_ >= 0))
assert(summary.numIter == 1)

model.setSummary(None)
assert(!model.hasSummary)
Expand Down
5 changes: 5 additions & 0 deletions project/MimaExcludes.scala
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,11 @@ object MimaExcludes {

// Exclude rules for 2.4.x
lazy val v24excludes = v23excludes ++ Seq(
// [SPARK-23528] Add numIter to ClusteringSummary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for other reviewers/myself these are all private spark constructors

ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.clustering.ClusteringSummary.this"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.clustering.KMeansSummary.this"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.clustering.BisectingKMeansSummary.this"),

// [SPARK-23412][ML] Add cosine distance measure to BisectingKmeans
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.org$apache$spark$ml$param$shared$HasDistanceMeasure$_setter_$distanceMeasure_="),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.getDistanceMeasure"),
Expand Down