[SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster size doesn't equal to k #16666

wangmiao1981 · 2017-01-21T01:07:34Z

What changes were proposed in this pull request

When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured k.

In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.

Example:

col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
cols <- as.data.frame(cbind(col1, col2, col3))
df <- createDataFrame(cols)

model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5)

summary(model2)
Error in colnames<-(*tmp*, value = c("col1", "col2", "col3")) :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In matrix(coefficients, ncol = k) :
data length [9] is not a sub-multiple or multiple of the number of rows [2]

Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.

How was this patch tested?

Add unit tests.

SparkQA · 2017-01-21T02:22:15Z

Test build #71750 has finished for PR 16666 at commit 2c1d02d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-21T06:56:08Z

ah - does bisecting kmeans have the same behavior?

felixcheung · 2017-01-21T06:58:48Z

R/pkg/R/mllib_clustering.R

-#'         (cluster centers of the transformed data).
+#'         \code{size} (number of data points in each cluster), \code{cluster}
+#'         (cluster centers of the transformed data), and \code{clusterSize}
+#'         (the actual number of cluster centers. When using initMode = "random",


let's add is.loaded here

OK. I will add it. For bisecting kmeans, I haven't found a case like this. This case only occurs when initMode is random and this behavior was due to one fix to kmeans implementation.

SparkQA · 2017-01-23T07:02:43Z

Test build #71826 has started for PR 16666 at commit d1a2d6c.

wangmiao1981 · 2017-01-23T17:54:38Z

Jenkins, retest this please.

SparkQA · 2017-01-23T18:58:10Z

Test build #71864 has finished for PR 16666 at commit d1a2d6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-23T18:59:41Z

LGTM

wangmiao1981 · 2017-01-31T21:28:38Z

ping @felixcheung

SparkQA · 2017-01-31T22:37:51Z

Test build #72207 has finished for PR 16666 at commit 2110536.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-01T00:19:36Z

Test build #72214 has finished for PR 16666 at commit 72fb951.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-02-01T05:19:12Z

merged, thanks!
I think it'll good to have this in branch-2.1 - @wangmiao1981 would you by any chance would like to backport this fix?

wangmiao1981 · 2017-02-01T05:40:54Z

I will backport it soon. Thanks!

…or when the cluster size doesn't equal to k ## What changes were proposed in this pull request? Backport fix of #16666 ## How was this patch tested? Backport unit tests Author: [email protected] <[email protected]> Closes #16761 from wangmiao1981/kmeansport.

…uster size doesn't equal to k ## What changes were proposed in this pull request When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`. In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k. Example: > col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > cols <- as.data.frame(cbind(col1, col2, col3)) > df <- createDataFrame(cols) > > model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5) > > summary(model2) Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) : length of 'dimnames' [2] not equal to array extent In addition: Warning message: In matrix(coefficients, ncol = k) : data length [9] is not a sub-multiple or multiple of the number of rows [2] Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix. ## How was this patch tested? Add unit tests. Author: [email protected] <[email protected]> Closes apache#16666 from wangmiao1981/kmeans.

fix kmeans bug

2c1d02d

felixcheung reviewed Jan 21, 2017

View reviewed changes

add is.loaded in comment

d1a2d6c

wangmiao1981 force-pushed the kmeans branch from 2110536 to d1a2d6c Compare January 31, 2017 23:12

Merge github.com:apache/spark into kmeans

52c9eb1

wangmiao1981 force-pushed the kmeans branch from ddaafce to 52c9eb1 Compare January 31, 2017 23:14

add fix

72fb951

asfgit closed this in 9ac0522 Feb 1, 2017

wangmiao1981 mentioned this pull request Feb 1, 2017

[BackPort-2.1][SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster size doesn't equal to k #16761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster size doesn't equal to k #16666

[SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster size doesn't equal to k #16666

wangmiao1981 commented Jan 21, 2017 •

edited

Loading

SparkQA commented Jan 21, 2017

felixcheung commented Jan 21, 2017

felixcheung Jan 21, 2017

wangmiao1981 Jan 23, 2017

SparkQA commented Jan 23, 2017

wangmiao1981 commented Jan 23, 2017

SparkQA commented Jan 23, 2017

felixcheung commented Jan 23, 2017

wangmiao1981 commented Jan 31, 2017

SparkQA commented Jan 31, 2017

SparkQA commented Feb 1, 2017

felixcheung commented Feb 1, 2017

wangmiao1981 commented Feb 1, 2017

[SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster size doesn't equal to k #16666

[SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster size doesn't equal to k #16666

Conversation

wangmiao1981 commented Jan 21, 2017 • edited Loading

What changes were proposed in this pull request

How was this patch tested?

SparkQA commented Jan 21, 2017

felixcheung commented Jan 21, 2017

felixcheung Jan 21, 2017

Choose a reason for hiding this comment

wangmiao1981 Jan 23, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 23, 2017

wangmiao1981 commented Jan 23, 2017

SparkQA commented Jan 23, 2017

felixcheung commented Jan 23, 2017

wangmiao1981 commented Jan 31, 2017

SparkQA commented Jan 31, 2017

SparkQA commented Feb 1, 2017

felixcheung commented Feb 1, 2017

wangmiao1981 commented Feb 1, 2017

wangmiao1981 commented Jan 21, 2017 •

edited

Loading