-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster size doesn't equal to k #16666
Conversation
Test build #71750 has finished for PR 16666 at commit
|
ah - does bisecting kmeans have the same behavior? |
R/pkg/R/mllib_clustering.R
Outdated
#' (cluster centers of the transformed data). | ||
#' \code{size} (number of data points in each cluster), \code{cluster} | ||
#' (cluster centers of the transformed data), and \code{clusterSize} | ||
#' (the actual number of cluster centers. When using initMode = "random", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add is.loaded
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will add it. For bisecting kmeans, I haven't found a case like this. This case only occurs when initMode is random and this behavior was due to one fix to kmeans implementation.
Test build #71826 has started for PR 16666 at commit |
Jenkins, retest this please. |
Test build #71864 has finished for PR 16666 at commit
|
LGTM |
ping @felixcheung |
Test build #72207 has finished for PR 16666 at commit
|
Test build #72214 has finished for PR 16666 at commit
|
merged, thanks! |
I will backport it soon. Thanks! |
…or when the cluster size doesn't equal to k ## What changes were proposed in this pull request? Backport fix of #16666 ## How was this patch tested? Backport unit tests Author: [email protected] <[email protected]> Closes #16761 from wangmiao1981/kmeansport.
…uster size doesn't equal to k ## What changes were proposed in this pull request When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`. In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k. Example: > col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > cols <- as.data.frame(cbind(col1, col2, col3)) > df <- createDataFrame(cols) > > model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5) > > summary(model2) Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) : length of 'dimnames' [2] not equal to array extent In addition: Warning message: In matrix(coefficients, ncol = k) : data length [9] is not a sub-multiple or multiple of the number of rows [2] Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix. ## How was this patch tested? Add unit tests. Author: [email protected] <[email protected]> Closes apache#16666 from wangmiao1981/kmeans.
What changes were proposed in this pull request
When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured
k
.In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.
Example:
Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
How was this patch tested?
Add unit tests.