[SPARK-18821][SparkR]: Bisecting k-means wrapper in SparkR #16566

wangmiao1981 · 2017-01-13T00:38:51Z

What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans.

As JIRA is down, I will update title to link with corresponding JIRA later.

How was this patch tested?

Add new unit tests.

SparkQA · 2017-01-13T00:46:34Z

Test build #71280 has finished for PR 16566 at commit 4f88cce.

This patch fails R style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class BisectingKMeansWrapperWriter(instance: BisectingKMeansWrapper) extends MLWriter
class BisectingKMeansWrapperReader extends MLReader[BisectingKMeansWrapper]

SparkQA · 2017-01-13T01:55:37Z

Test build #71282 has finished for PR 16566 at commit e7ea299.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-13T04:48:02Z

* checking Rd \usage sections ... WARNING
Duplicated \argument entries in documentation object 'fitted':
  'object' 'method' '...'

SparkQA · 2017-01-13T07:18:40Z

Test build #71298 has started for PR 16566 at commit 2ad596e.

wangmiao1981 · 2017-01-13T18:02:36Z

Jenkins, retest this please.

SparkQA · 2017-01-13T19:13:55Z

Test build #71337 has finished for PR 16566 at commit 2ad596e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-14T01:24:47Z

R/pkg/R/mllib_clustering.R

+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' data(iris)


don't need data(iris)

felixcheung · 2017-01-14T01:31:02Z

R/pkg/R/mllib_clustering.R

+#' Note: A saved-loaded model does not support this method.
+#'
+#' @return \code{fitted} returns a SparkDataFrame containing fitted values.
+#' @rdname fitted


I think this should go to @rdname spark.bisectingKmeans

felixcheung · 2017-01-14T01:32:17Z

R/pkg/R/mllib_clustering.R

+            if (is.loaded) {
+              stop("Saved-loaded bisecting k-means model does not support 'fitted' method")
+            } else {
+              dataFrame(callJMethod(jobj, "fitted", method))


how much is returned from fitted? should this be a list (like in summary) instead of DataFrame?

fitted in bisectingKmeans is quite similar to fitted in Kmeans. I followed that style to return a dataframe.

felixcheung · 2017-01-14T01:33:16Z

mllib/src/main/scala/org/apache/spark/ml/r/BisectingKMeansWrapper.scala

+    val size: Array[Long],
+    val isLoaded: Boolean = false) extends MLWritable {
+  private val bisectingKmeansModel: BisectingKMeansModel =
+    pipeline.stages(1).asInstanceOf[BisectingKMeansModel]


instead of 1, find last?

felixcheung · 2017-01-14T01:35:22Z

mllib/src/main/scala/org/apache/spark/ml/r/BisectingKMeansWrapper.scala

+  private val bisectingKmeansModel: BisectingKMeansModel =
+    pipeline.stages(1).asInstanceOf[BisectingKMeansModel]
+
+  lazy val coefficients: Array[Double] = bisectingKmeansModel.clusterCenters.flatMap(_.toArray)


clusterCenters is already an Array?

It is Array[Vector]. I need flatmap to transform it into Array[Double], which is similar to Kmeans.
In addition, we have the serialization bug of not supporint Vector type open.

felixcheung · 2017-01-14T01:38:37Z

R/pkg/R/mllib_clustering.R

+#' @note spark.bisectingKmeans since 2.2.0
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.bisectingKmeans", signature(data = "SparkDataFrame", formula = "formula"),
+          function(data, formula, k = 4, maxIter = 20, minDivisibleClusterSize = 1.0, seed = NULL) {


I'd move minDivisibleClusterSize to the end since it's expert parameter and add note in param doc above (should be examples in mllib-tree.R)

I will address comments soon. Now, debugging. Thanks!

felixcheung · 2017-01-15T18:38:41Z

R/pkg/R/mllib_clustering.R

+#' @examples
+#' \dontrun{
+#' model <- spark.bisectingKmeans(trainingData, ~ ., 2)
+#' fitted.model <- fitted(model)


it seems method parameter is not optional (there is no default value) - so the example would need to show that as well?

felixcheung · 2017-01-15T18:39:43Z

R/pkg/R/mllib_clustering.R

+#' showDF(fitted.model)
+#'}
+#' @note fitted since 2.2.0
+setMethod("fitted", signature(object = "BisectingKMeansModel"),


we should probably get some feedback on this - none of the current ML model has a fitted method - should we have this now? or should this be a option/parameter of the summary method?

spark.kmeans has the fitted method. As these two are similar, I added it to bisecting kmeans.

ah, I didn't recall that. I think that's ok then

wangmiao1981

Comments addressed.

SparkQA · 2017-01-19T23:37:18Z

Test build #71675 has finished for PR 16566 at commit e77cbaf.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-20T01:47:42Z

Test build #71683 has finished for PR 16566 at commit 83b2d6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-20T05:22:45Z

R/pkg/R/mllib_clustering.R

+#' @param seed the random seed.
+#' @param minDivisibleClusterSize The minimum number of points (if greater than or equal to 1.0)
+#'                                or the minimum proportion of points (if less than 1.0) of a divisible cluster.
+#'                                Note that it is an advanced. The default value should be enough


Note that it is an advanced.
do you mean to say Note that it is an advanced option.?

as far as I recall the term used in spark.ml doc is "expert parameter" - you might want to check how it is explained there.

In scala, it uses @group expertParam in the document and the API document shows (expert-only) Parameters. I will change it to it is an expert parameter.

SparkQA · 2017-01-20T06:18:41Z

Test build #71706 has started for PR 16566 at commit b25fc83.

wangmiao1981 · 2017-01-20T17:30:37Z

retest this please

SparkQA · 2017-01-20T18:37:20Z

Test build #71731 has finished for PR 16566 at commit b25fc83.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-01-20T19:57:42Z

Close to trigger windows test

wangmiao1981 · 2017-01-20T19:57:51Z

open to trigger windows test

felixcheung · 2017-01-21T06:34:25Z

R/pkg/R/mllib_clustering.R

+#' \dontrun{
+#' model <- spark.bisectingKmeans(trainingData, ~ ., 2)
+#' fitted.model <- fitted(model, "centers")
+#' showDF(fitted.model)


nit, if you might end up another iteration, I'd suggest moving the example to before setMethod("spark.bisectingKmeans" - that's generally our guideline (and param) to have them in the same place if they have the same rdname (ie. going to the same page)

felixcheung · 2017-01-21T06:43:45Z

R/pkg/R/mllib_clustering.R

+#'         The list includes the model's \code{k} (number of cluster centers),
+#'         \code{coefficients} (model cluster centers),
+#'         \code{size} (number of data points in each cluster), and \code{cluster}
+#'         (cluster centers of the transformed data).


let's add is.loaded here

also clarify cluster is NULL if is.loaded = T

felixcheung · 2017-01-21T06:49:08Z

mllib/src/main/scala/org/apache/spark/ml/r/BisectingKMeansWrapper.scala

+
+  lazy val k: Int = bisectingKmeansModel.getK
+
+  lazy val cluster: DataFrame = bisectingKmeansModel.summary.cluster


does this have valid values when the model is loaded?

ah this is checked on the R side. could you add a comment here

felixcheung · 2017-01-21T06:50:45Z

mllib/src/main/scala/org/apache/spark/ml/r/BisectingKMeansWrapper.scala

+      .fit(data)
+
+    val bisectingKmeansModel: BisectingKMeansModel =
+      pipeline.stages(1).asInstanceOf[BisectingKMeansModel]


let's be consistent here with L38 - either (1) or last

felixcheung · 2017-01-21T06:53:02Z

couple of last comments.
@yanboliang do you have any comment?

SparkQA · 2017-01-23T07:22:45Z

Test build #71828 has started for PR 16566 at commit d36c23a.

wangmiao1981 · 2017-01-23T17:55:23Z

Jenkins, retest this please.

felixcheung · 2017-01-23T18:57:57Z

LGTM

SparkQA · 2017-01-23T19:00:15Z

Test build #71865 has finished for PR 16566 at commit d36c23a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-27T05:02:51Z

merged to master. Let's follow up with programming guide, example and vignettes - would you be able to pick these up too @wangmiao1981 ?

wangmiao1981 · 2017-01-27T17:06:21Z

@felixcheung I will take care of it very soon. Now I am working on the PR of vector serialization. Also, I started working on the SparkR serialization performance. Thanks!

## What changes were proposed in this pull request? Add R wrapper for bisecting Kmeans. As JIRA is down, I will update title to link with corresponding JIRA later. ## How was this patch tested? Add new unit tests. Author: [email protected] <[email protected]> Closes apache#16566 from wangmiao1981/bk.

wangmiao1981 changed the title ~~[SparkR]: add bisecting kmeans R wrapper~~ [SPARK-18821][SparkR]: Bisecting k-means wrapper in SparkR Jan 13, 2017

felixcheung reviewed Jan 14, 2017

View reviewed changes

R/pkg/R/mllib_clustering.R

#' @examples

#' \dontrun{

#' sparkR.session()

#' data(iris)

Copy link

Member

felixcheung Jan 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need data(iris)

felixcheung reviewed Jan 14, 2017

View reviewed changes

felixcheung reviewed Jan 15, 2017

View reviewed changes

wangmiao1981 added 4 commits January 18, 2017 09:40

add bisecting kmeans R wrapper

ce2174e

fix example and R style

0b246f7

remove document entry

68e432e

address review comments

e77cbaf

wangmiao1981 force-pushed the bk branch from 2ad596e to e77cbaf Compare January 19, 2017 22:19

wangmiao1981 commented Jan 19, 2017

View reviewed changes

address document failure

83b2d6f

felixcheung reviewed Jan 20, 2017

View reviewed changes

modify document

b25fc83

wangmiao1981 closed this Jan 20, 2017

wangmiao1981 reopened this Jan 20, 2017

felixcheung reviewed Jan 21, 2017

View reviewed changes

address review comments

d36c23a

asfgit closed this in c0ba284 Jan 27, 2017


		lazy val k: Int = bisectingKmeansModel.getK

		lazy val cluster: DataFrame = bisectingKmeansModel.summary.cluster

[SPARK-18821][SparkR]: Bisecting k-means wrapper in SparkR #16566

[SPARK-18821][SparkR]: Bisecting k-means wrapper in SparkR #16566

Conversation

wangmiao1981 commented Jan 13, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 13, 2017

SparkQA commented Jan 13, 2017

felixcheung commented Jan 13, 2017

SparkQA commented Jan 13, 2017

wangmiao1981 commented Jan 13, 2017

SparkQA commented Jan 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangmiao1981 left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 19, 2017

SparkQA commented Jan 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 20, 2017

wangmiao1981 commented Jan 20, 2017

SparkQA commented Jan 20, 2017

wangmiao1981 commented Jan 20, 2017

wangmiao1981 commented Jan 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Jan 21, 2017

SparkQA commented Jan 23, 2017

wangmiao1981 commented Jan 23, 2017

felixcheung commented Jan 23, 2017

SparkQA commented Jan 23, 2017

felixcheung commented Jan 27, 2017

wangmiao1981 commented Jan 27, 2017