Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18821][SparkR]: Bisecting k-means wrapper in SparkR #16566

Closed
wants to merge 7 commits into from

Conversation

wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans.

As JIRA is down, I will update title to link with corresponding JIRA later.

How was this patch tested?

Add new unit tests.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71280 has finished for PR 16566 at commit 4f88cce.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class BisectingKMeansWrapperWriter(instance: BisectingKMeansWrapper) extends MLWriter
  • class BisectingKMeansWrapperReader extends MLReader[BisectingKMeansWrapper]

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71282 has finished for PR 16566 at commit e7ea299.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

* checking Rd \usage sections ... WARNING
Duplicated \argument entries in documentation object 'fitted':
  'object' 'method' '...'

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71298 has started for PR 16566 at commit 2ad596e.

@wangmiao1981
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71337 has finished for PR 16566 at commit 2ad596e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981 wangmiao1981 changed the title [SparkR]: add bisecting kmeans R wrapper [SPARK-18821][SparkR]: Bisecting k-means wrapper in SparkR Jan 13, 2017
#' @examples
#' \dontrun{
#' sparkR.session()
#' data(iris)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need data(iris)

#' Note: A saved-loaded model does not support this method.
#'
#' @return \code{fitted} returns a SparkDataFrame containing fitted values.
#' @rdname fitted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go to @rdname spark.bisectingKmeans

if (is.loaded) {
stop("Saved-loaded bisecting k-means model does not support 'fitted' method")
} else {
dataFrame(callJMethod(jobj, "fitted", method))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much is returned from fitted? should this be a list (like in summary) instead of DataFrame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fitted in bisectingKmeans is quite similar to fitted in Kmeans. I followed that style to return a dataframe.

val size: Array[Long],
val isLoaded: Boolean = false) extends MLWritable {
private val bisectingKmeansModel: BisectingKMeansModel =
pipeline.stages(1).asInstanceOf[BisectingKMeansModel]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of 1, find last?

private val bisectingKmeansModel: BisectingKMeansModel =
pipeline.stages(1).asInstanceOf[BisectingKMeansModel]

lazy val coefficients: Array[Double] = bisectingKmeansModel.clusterCenters.flatMap(_.toArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterCenters is already an Array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is Array[Vector]. I need flatmap to transform it into Array[Double], which is similar to Kmeans.
In addition, we have the serialization bug of not supporint Vector type open.

#' @note spark.bisectingKmeans since 2.2.0
#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
setMethod("spark.bisectingKmeans", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, k = 4, maxIter = 20, minDivisibleClusterSize = 1.0, seed = NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move minDivisibleClusterSize to the end since it's expert parameter and add note in param doc above (should be examples in mllib-tree.R)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address comments soon. Now, debugging. Thanks!

#' @examples
#' \dontrun{
#' model <- spark.bisectingKmeans(trainingData, ~ ., 2)
#' fitted.model <- fitted(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems method parameter is not optional (there is no default value) - so the example would need to show that as well?

#' showDF(fitted.model)
#'}
#' @note fitted since 2.2.0
setMethod("fitted", signature(object = "BisectingKMeansModel"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably get some feedback on this - none of the current ML model has a fitted method - should we have this now? or should this be a option/parameter of the summary method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.kmeans has the fitted method. As these two are similar, I added it to bisecting kmeans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I didn't recall that. I think that's ok then

Copy link
Contributor Author

@wangmiao1981 wangmiao1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71675 has finished for PR 16566 at commit e77cbaf.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71683 has finished for PR 16566 at commit 83b2d6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @param seed the random seed.
#' @param minDivisibleClusterSize The minimum number of points (if greater than or equal to 1.0)
#' or the minimum proportion of points (if less than 1.0) of a divisible cluster.
#' Note that it is an advanced. The default value should be enough
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is an advanced.
do you mean to say Note that it is an advanced option.?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I recall the term used in spark.ml doc is "expert parameter" - you might want to check how it is explained there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scala, it uses @group expertParam in the document and the API document shows (expert-only) Parameters. I will change it to it is an expert parameter.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71706 has started for PR 16566 at commit b25fc83.

@wangmiao1981
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71731 has finished for PR 16566 at commit b25fc83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

Close to trigger windows test

@wangmiao1981
Copy link
Contributor Author

open to trigger windows test

@wangmiao1981 wangmiao1981 reopened this Jan 20, 2017
#' \dontrun{
#' model <- spark.bisectingKmeans(trainingData, ~ ., 2)
#' fitted.model <- fitted(model, "centers")
#' showDF(fitted.model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, if you might end up another iteration, I'd suggest moving the example to before setMethod("spark.bisectingKmeans" - that's generally our guideline (and param) to have them in the same place if they have the same rdname (ie. going to the same page)

#' The list includes the model's \code{k} (number of cluster centers),
#' \code{coefficients} (model cluster centers),
#' \code{size} (number of data points in each cluster), and \code{cluster}
#' (cluster centers of the transformed data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add is.loaded here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also clarify cluster is NULL if is.loaded = T


lazy val k: Int = bisectingKmeansModel.getK

lazy val cluster: DataFrame = bisectingKmeansModel.summary.cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have valid values when the model is loaded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this is checked on the R side. could you add a comment here

.fit(data)

val bisectingKmeansModel: BisectingKMeansModel =
pipeline.stages(1).asInstanceOf[BisectingKMeansModel]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's be consistent here with L38 - either (1) or last

@felixcheung
Copy link
Member

couple of last comments.
@yanboliang do you have any comment?

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71828 has started for PR 16566 at commit d36c23a.

@wangmiao1981
Copy link
Contributor Author

Jenkins, retest this please.

@felixcheung
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71865 has finished for PR 16566 at commit d36c23a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

merged to master. Let's follow up with programming guide, example and vignettes - would you be able to pick these up too @wangmiao1981 ?

@asfgit asfgit closed this in c0ba284 Jan 27, 2017
@wangmiao1981
Copy link
Contributor Author

@felixcheung I will take care of it very soon. Now I am working on the PR of vector serialization. Also, I started working on the SparkR serialization performance. Thanks!

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans.

As JIRA is down, I will update title to link with corresponding JIRA later.

## How was this patch tested?

Add new unit tests.

Author: [email protected] <[email protected]>

Closes apache#16566 from wangmiao1981/bk.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans.

As JIRA is down, I will update title to link with corresponding JIRA later.

## How was this patch tested?

Add new unit tests.

Author: [email protected] <[email protected]>

Closes apache#16566 from wangmiao1981/bk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants