Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster size doesn't equal to k #16666

Closed
wants to merge 4 commits into from

Conversation

wangmiao1981
Copy link
Contributor

@wangmiao1981 wangmiao1981 commented Jan 21, 2017

What changes were proposed in this pull request

When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured k.

In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.

Example:

col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
cols <- as.data.frame(cbind(col1, col2, col3))
df <- createDataFrame(cols)

model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5)

summary(model2)
Error in colnames<-(*tmp*, value = c("col1", "col2", "col3")) :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In matrix(coefficients, ncol = k) :
data length [9] is not a sub-multiple or multiple of the number of rows [2]

Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.

How was this patch tested?

Add unit tests.

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71750 has finished for PR 16666 at commit 2c1d02d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

ah - does bisecting kmeans have the same behavior?

#' (cluster centers of the transformed data).
#' \code{size} (number of data points in each cluster), \code{cluster}
#' (cluster centers of the transformed data), and \code{clusterSize}
#' (the actual number of cluster centers. When using initMode = "random",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add is.loaded here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will add it. For bisecting kmeans, I haven't found a case like this. This case only occurs when initMode is random and this behavior was due to one fix to kmeans implementation.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71826 has started for PR 16666 at commit d1a2d6c.

@wangmiao1981
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71864 has finished for PR 16666 at commit d1a2d6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

LGTM

@wangmiao1981
Copy link
Contributor Author

ping @felixcheung

@SparkQA
Copy link

SparkQA commented Jan 31, 2017

Test build #72207 has finished for PR 16666 at commit 2110536.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 1, 2017

Test build #72214 has finished for PR 16666 at commit 72fb951.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 9ac0522 Feb 1, 2017
@felixcheung
Copy link
Member

merged, thanks!
I think it'll good to have this in branch-2.1 - @wangmiao1981 would you by any chance would like to backport this fix?

@wangmiao1981
Copy link
Contributor Author

I will backport it soon. Thanks!

asfgit pushed a commit that referenced this pull request Feb 12, 2017
…or when the cluster size doesn't equal to k

## What changes were proposed in this pull request?

Backport fix of #16666

## How was this patch tested?

Backport unit tests

Author: [email protected] <[email protected]>

Closes #16761 from wangmiao1981/kmeansport.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…uster size doesn't equal to k

## What changes were proposed in this pull request

When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`.

In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.

Example:
>  col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
>   col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
>   col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
>   cols <- as.data.frame(cbind(col1, col2, col3))
>   df <- createDataFrame(cols)
>
>   model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,  initMode = "random", seed = 22222, tol = 1E-5)
>
> summary(model2)
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
  length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In matrix(coefficients, ncol = k) :
  data length [9] is not a sub-multiple or multiple of the number of rows [2]

Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
## How was this patch tested?

Add unit tests.

Author: [email protected] <[email protected]>

Closes apache#16666 from wangmiao1981/kmeans.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants