-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18821][SparkR]: Bisecting k-means wrapper in SparkR #16566
Conversation
Test build #71280 has finished for PR 16566 at commit
|
Test build #71282 has finished for PR 16566 at commit
|
|
Test build #71298 has started for PR 16566 at commit |
Jenkins, retest this please. |
Test build #71337 has finished for PR 16566 at commit
|
#' @examples | ||
#' \dontrun{ | ||
#' sparkR.session() | ||
#' data(iris) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need data(iris)
#' Note: A saved-loaded model does not support this method. | ||
#' | ||
#' @return \code{fitted} returns a SparkDataFrame containing fitted values. | ||
#' @rdname fitted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go to @rdname spark.bisectingKmeans
if (is.loaded) { | ||
stop("Saved-loaded bisecting k-means model does not support 'fitted' method") | ||
} else { | ||
dataFrame(callJMethod(jobj, "fitted", method)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much is returned from fitted
? should this be a list (like in summary) instead of DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fitted
in bisectingKmeans is quite similar to fitted
in Kmeans. I followed that style to return a dataframe.
val size: Array[Long], | ||
val isLoaded: Boolean = false) extends MLWritable { | ||
private val bisectingKmeansModel: BisectingKMeansModel = | ||
pipeline.stages(1).asInstanceOf[BisectingKMeansModel] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of 1, find last?
private val bisectingKmeansModel: BisectingKMeansModel = | ||
pipeline.stages(1).asInstanceOf[BisectingKMeansModel] | ||
|
||
lazy val coefficients: Array[Double] = bisectingKmeansModel.clusterCenters.flatMap(_.toArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusterCenters is already an Array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is Array[Vector]
. I need flatmap to transform it into Array[Double]
, which is similar to Kmeans.
In addition, we have the serialization bug of not supporint Vector type open.
#' @note spark.bisectingKmeans since 2.2.0 | ||
#' @seealso \link{predict}, \link{read.ml}, \link{write.ml} | ||
setMethod("spark.bisectingKmeans", signature(data = "SparkDataFrame", formula = "formula"), | ||
function(data, formula, k = 4, maxIter = 20, minDivisibleClusterSize = 1.0, seed = NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move minDivisibleClusterSize
to the end since it's expert parameter and add note in param doc above (should be examples in mllib-tree.R)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will address comments soon. Now, debugging. Thanks!
#' @examples | ||
#' \dontrun{ | ||
#' model <- spark.bisectingKmeans(trainingData, ~ ., 2) | ||
#' fitted.model <- fitted(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems method
parameter is not optional (there is no default value) - so the example would need to show that as well?
#' showDF(fitted.model) | ||
#'} | ||
#' @note fitted since 2.2.0 | ||
setMethod("fitted", signature(object = "BisectingKMeansModel"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably get some feedback on this - none of the current ML model has a fitted
method - should we have this now? or should this be a option/parameter of the summary
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.kmeans
has the fitted
method. As these two are similar, I added it to bisecting kmeans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I didn't recall that. I think that's ok then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments addressed.
Test build #71675 has finished for PR 16566 at commit
|
Test build #71683 has finished for PR 16566 at commit
|
#' @param seed the random seed. | ||
#' @param minDivisibleClusterSize The minimum number of points (if greater than or equal to 1.0) | ||
#' or the minimum proportion of points (if less than 1.0) of a divisible cluster. | ||
#' Note that it is an advanced. The default value should be enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it is an advanced.
do you mean to say Note that it is an advanced option.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I recall the term used in spark.ml doc is "expert parameter" - you might want to check how it is explained there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In scala, it uses @group expertParam
in the document and the API document shows (expert-only) Parameters
. I will change it to it is an expert parameter
.
Test build #71706 has started for PR 16566 at commit |
retest this please |
Test build #71731 has finished for PR 16566 at commit
|
Close to trigger windows test |
open to trigger windows test |
#' \dontrun{ | ||
#' model <- spark.bisectingKmeans(trainingData, ~ ., 2) | ||
#' fitted.model <- fitted(model, "centers") | ||
#' showDF(fitted.model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, if you might end up another iteration, I'd suggest moving the example to before setMethod("spark.bisectingKmeans"
- that's generally our guideline (and param) to have them in the same place if they have the same rdname
(ie. going to the same page)
#' The list includes the model's \code{k} (number of cluster centers), | ||
#' \code{coefficients} (model cluster centers), | ||
#' \code{size} (number of data points in each cluster), and \code{cluster} | ||
#' (cluster centers of the transformed data). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add is.loaded
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also clarify cluster
is NULL if is.loaded = T
|
||
lazy val k: Int = bisectingKmeansModel.getK | ||
|
||
lazy val cluster: DataFrame = bisectingKmeansModel.summary.cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have valid values when the model is loaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah this is checked on the R side. could you add a comment here
.fit(data) | ||
|
||
val bisectingKmeansModel: BisectingKMeansModel = | ||
pipeline.stages(1).asInstanceOf[BisectingKMeansModel] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's be consistent here with L38 - either (1) or last
couple of last comments. |
Test build #71828 has started for PR 16566 at commit |
Jenkins, retest this please. |
LGTM |
Test build #71865 has finished for PR 16566 at commit
|
merged to master. Let's follow up with programming guide, example and vignettes - would you be able to pick these up too @wangmiao1981 ? |
@felixcheung I will take care of it very soon. Now I am working on the PR of vector serialization. Also, I started working on the SparkR serialization performance. Thanks! |
## What changes were proposed in this pull request? Add R wrapper for bisecting Kmeans. As JIRA is down, I will update title to link with corresponding JIRA later. ## How was this patch tested? Add new unit tests. Author: [email protected] <[email protected]> Closes apache#16566 from wangmiao1981/bk.
## What changes were proposed in this pull request? Add R wrapper for bisecting Kmeans. As JIRA is down, I will update title to link with corresponding JIRA later. ## How was this patch tested? Add new unit tests. Author: [email protected] <[email protected]> Closes apache#16566 from wangmiao1981/bk.
What changes were proposed in this pull request?
Add R wrapper for bisecting Kmeans.
As JIRA is down, I will update title to link with corresponding JIRA later.
How was this patch tested?
Add new unit tests.