[SPARK-23528][ML] Add numIter to ClusteringSummary #20701

mgaido91 · 2018-03-01T11:11:49Z

What changes were proposed in this pull request?

Added the number of iterations in ClusteringSummary. This is an helpful information in evaluating how to eventually modify the parameters in order to get a better model.

How was this patch tested?

modified existing UTs

mgaido91 · 2018-03-01T11:12:26Z

cc @yanboliang @zhengruifeng since I saw you worked on this before, thanks.

SparkQA · 2018-03-01T11:24:45Z

Test build #87829 has finished for PR 20701 at commit e3a217c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-01T15:33:50Z

Test build #87830 has finished for PR 20701 at commit d5c8af7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

I've done a quick pass and I'm going to see if @sethah has some comments.

holdenk · 2018-03-09T19:48:14Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

@@ -46,6 +47,10 @@ class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vec
  private val clusterCentersWithNorm =
    if (clusterCenters == null) null else clusterCenters.map(new VectorWithNorm(_))

+  @Since("2.4.0")
+  def this(clusterCenters: Array[Vector], distanceMeasure: String) =
+    this(clusterCenters: Array[Vector], distanceMeasure, -1)


So were using -1 to indicate we don't have the numIter information

yes, this can happen for instance when reloading a persisted model. Moreover this is only for the mllib model, which as far as I know is suggested not to be used anymore in favor of the new ml api. Any concern/suggestion about this?

Sounds reasonable, I personally don't enjoy -1 to indicate lack of information but it seems to be what we have generally used in the past for mllib summary info into ml so my personal feelings aren't important :)

holdenk · 2018-03-09T19:50:25Z

mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala

@@ -97,6 +97,7 @@ class BisectingKMeansSuite
  test("fit, transform and summary") {
    val predictionColName = "bisecting_kmeans_prediction"
    val bkm = new BisectingKMeans().setK(k).setPredictionCol(predictionColName).setSeed(1)
+      .setMaxIter(2)


So I'd be more comfortable having this in a separate test, 2 iterations is not a lot.

holdenk · 2018-03-09T19:50:47Z

mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala

@@ -127,6 +128,7 @@ class BisectingKMeansSuite
    assert(clusterSizes.length === k)
    assert(clusterSizes.sum === numRows)
    assert(clusterSizes.forall(_ >= 0))
+    assert(summary.numIter == 2)


Would be nice to see a test where its not maxIter value being copied over

In KMeansSuite the value is not maxIter (it performs only 1 iteration in that case). In BisectingKMeans numIter is always maxIter since we are always performing maxIter (see

spark/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala

Line 192 in b6f837c

for (iter <- 0 until maxIterations) {

).

Does it answer to your comment?

SparkQA · 2018-03-10T15:00:42Z

Test build #88150 has finished for PR 20701 at commit b3d0523.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-16T15:12:11Z

Test build #88306 has finished for PR 20701 at commit 8b16af6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

Thanks for working on this, good progress. I have a few improvements in mind, and maybe we can get @sethah to take a look as well, but if the rest of the ML committers are busy thats ok too.

holdenk · 2018-03-16T23:18:36Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

@@ -46,6 +47,10 @@ class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vec
  private val clusterCentersWithNorm =
    if (clusterCenters == null) null else clusterCenters.map(new VectorWithNorm(_))

+  @Since("2.4.0")


So I think the correct since annotation here would be 0.8.0 since this is just a move of the previous constructor right?

I think this is the right one. 0.8.0 is the annotation for the KMeansModel class, while the previous main constructor was added (by me) is a previous PR for 2.4.0 in order to add the distanceMeasure variable.

Why does this constructor need to be public?

yes, I will make it private, thanks.

holdenk · 2018-03-16T23:20:59Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

@@ -312,4 +312,5 @@ class BisectingKMeansSummary private[clustering] (
    predictions: DataFrame,
    predictionCol: String,
    featuresCol: String,
-    k: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k)
+    k: Int,
+    numIter: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k, numIter)


Here (and in the others), we should add this as param in the comment above as done with the other params

thanks for pointing this out, I completely missed it. Thank you, I am adding them.

holdenk · 2018-03-16T23:21:30Z

mllib/src/main/scala/org/apache/spark/ml/clustering/ClusteringSummary.scala

@@ -34,7 +34,8 @@ class ClusteringSummary private[clustering] (
    @transient val predictions: DataFrame,
    val predictionCol: String,
    val featuresCol: String,
-    val k: Int) extends Serializable {
+    val k: Int,
+    @Since("2.4.0") val numIter: Int) extends Serializable {


Please add this param in the comment above.

holdenk · 2018-03-16T23:26:51Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

@@ -46,6 +47,10 @@ class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vec
  private val clusterCentersWithNorm =
    if (clusterCenters == null) null else clusterCenters.map(new VectorWithNorm(_))

+  @Since("2.4.0")
+  def this(clusterCenters: Array[Vector], distanceMeasure: String) =
+    this(clusterCenters: Array[Vector], distanceMeasure, -1)


Sounds reasonable, I personally don't enjoy -1 to indicate lack of information but it seems to be what we have generally used in the past for mllib summary info into ml so my personal feelings aren't important :)

holdenk · 2018-03-16T23:29:17Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

@@ -36,8 +36,9 @@ import org.apache.spark.sql.{Row, SparkSession}
 * A clustering model for K-means. Each point belongs to the cluster with the closest center.
 */
 @Since("0.8.0")
-class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector],
-  @Since("2.4.0") val distanceMeasure: String)
+class KMeansModel private[spark] (@Since("1.0.0") val clusterCenters: Array[Vector],


So previously the main constructor was not private, any particular reason we are making in private? if someone else is implementing something which extends the kmeans model this might be a little frustrating.

I just didn't want the user to be able to create a KMeansModel setting the number of iterations. I moved the other constructor which is still available. I don't have strong reasons against making this public, so I am removing the private clause if you think we best let it to be public.

holdenk · 2018-03-16T23:29:45Z

project/MimaExcludes.scala

@@ -36,6 +36,11 @@ object MimaExcludes {

  // Exclude rules for 2.4.x
  lazy val v24excludes = v23excludes ++ Seq(
+    // [SPARK-23528] Add numIter to ClusteringSummary


Just a note for other reviewers/myself these are all private spark constructors

SparkQA · 2018-03-18T19:31:19Z

Test build #88355 has finished for PR 20701 at commit f6ee4a2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],

mgaido91 · 2018-03-19T09:05:54Z

retest this please

SparkQA · 2018-03-19T16:40:46Z

Test build #88374 has finished for PR 20701 at commit f6ee4a2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],

mgaido91 · 2018-03-19T18:32:58Z

retest this please

SparkQA · 2018-03-19T22:31:58Z

Test build #88381 has finished for PR 20701 at commit f6ee4a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],

mgaido91 · 2018-03-24T11:01:15Z

any more comments @holdenk ?

sethah

General comment: things that are specific to training, like numIter, have been separated into training summary classes elsewhere, e.g. LinearRegressionTrainingSummary extends LinearRegressionSummary. Is there some reason to deviate from that here? numIter doesn't make sense when evaluating on a test set, for instance.

sethah · 2018-03-26T19:03:38Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

@@ -46,6 +47,10 @@ class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vec
  private val clusterCentersWithNorm =
    if (clusterCenters == null) null else clusterCenters.map(new VectorWithNorm(_))

+  @Since("2.4.0")


Why does this constructor need to be public?

mgaido91 · 2018-03-28T10:53:29Z

@sethah I have not introduces training summary classes because it would have meant a quite bigger change - since they have a quite different approach, having a trait and an Impl class for each of them - and I have not seen that pattern to be always used.

mgaido91 · 2018-04-03T14:37:31Z

retest this please

SparkQA · 2018-04-03T19:37:48Z

Test build #88848 has finished for PR 20701 at commit 41f0371.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-06T07:28:23Z

any more comments @holdenk @sethah ?

holdenk · 2018-04-06T18:37:24Z

ping @sethah - what do you think about if this needs a separate training summary trait?

holdenk · 2018-04-13T18:37:08Z

So we need to update for the changed MimaExcludes, I think its ok to include this in the model directly if no one objects in the next week or so? Sklearn has this directly in the model return as well. Ping @sethah @MLnick .

SparkQA · 2018-04-23T16:56:19Z

Test build #89722 has finished for PR 20701 at commit 59fef4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-27T18:13:51Z

kindly ping @holdenk

mgaido91 · 2018-05-14T11:59:00Z

kindly ping @holdenk

SparkQA · 2018-05-29T18:20:13Z

Test build #91256 has finished for PR 20701 at commit e2f68ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-06-22T16:27:51Z

LGTM pending Jenkins retest. Jenkins retest this please.

holdenk · 2018-06-22T16:28:10Z

The AppVeyor build failure looks spurrious and I don't know how to retrigger it.

mgaido91 · 2018-06-23T13:07:54Z

retest this please

mgaido91 · 2018-06-23T13:08:38Z

thanks for your review @holdenk. I don't know how to retrigger AppVeyor too, unfortunately :(

SparkQA · 2018-06-23T17:31:44Z

Test build #92257 has finished for PR 20701 at commit e2f68ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-12T13:36:44Z

Test build #92924 has finished for PR 20701 at commit 4a6bd2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

Looks good to me, manually built docs locally to double check the since annotation was inherited because my memory was a bit fuzzy on how that was handled.

holdenk · 2018-07-13T18:25:02Z

Merged to master

[SPARK-23528][ML] Add numIter to ClusteringSummary

e3a217c

add MiMa excludes

d5c8af7

holdenk reviewed Mar 9, 2018

View reviewed changes

avoid setting maxIter to 2 in BisectingKMeansSuite

b3d0523

Merge branch 'master' into SPARK-23528

8b16af6

holdenk requested changes Mar 16, 2018

View reviewed changes

add comments

f6ee4a2

sethah reviewed Mar 26, 2018

View reviewed changes

make KMeansModel constructor private

41f0371

Merge branch 'master' into SPARK-23528

59fef4e

Merge branch 'master' into SPARK-23528

e2f68ac

Merge branch 'master' into SPARK-23528

4a6bd2d

holdenk approved these changes Jul 13, 2018

View reviewed changes

asfgit closed this in 3b6005b Jul 13, 2018

[SPARK-23528][ML] Add numIter to ClusteringSummary #20701

[SPARK-23528][ML] Add numIter to ClusteringSummary #20701

Conversation

mgaido91 commented Mar 1, 2018

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 commented Mar 1, 2018

SparkQA commented Mar 1, 2018

SparkQA commented Mar 1, 2018

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 10, 2018

SparkQA commented Mar 16, 2018

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 18, 2018

mgaido91 commented Mar 19, 2018

SparkQA commented Mar 19, 2018

mgaido91 commented Mar 19, 2018

SparkQA commented Mar 19, 2018

mgaido91 commented Mar 24, 2018

sethah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 commented Mar 28, 2018

mgaido91 commented Apr 3, 2018

SparkQA commented Apr 3, 2018

mgaido91 commented Apr 6, 2018

holdenk commented Apr 6, 2018

holdenk commented Apr 13, 2018

SparkQA commented Apr 23, 2018

mgaido91 commented Apr 27, 2018

mgaido91 commented May 14, 2018

SparkQA commented May 29, 2018

holdenk commented Jun 22, 2018

holdenk commented Jun 22, 2018

mgaido91 commented Jun 23, 2018

mgaido91 commented Jun 23, 2018

SparkQA commented Jun 23, 2018

SparkQA commented Jul 12, 2018

holdenk left a comment

Choose a reason for hiding this comment

holdenk commented Jul 13, 2018