[SPARK-19456][SparkR]:Add LinearSVC R API #16800

wangmiao1981 · 2017-02-04T01:25:39Z

What changes were proposed in this pull request?

Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.

Marked as WIP, as I am designing unit tests.

How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2017-02-04T02:31:23Z

Test build #72339 has finished for PR 16800 at commit a4dceec.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T02:39:27Z

Test build #72340 has finished for PR 16800 at commit 5b4c406.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-02-04T06:45:05Z

retest this please

SparkQA · 2017-02-04T07:53:25Z

Test build #72361 has finished for PR 16800 at commit 5b4c406.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-02-04T08:23:12Z

R/pkg/R/generics.R

@@ -1376,6 +1376,10 @@ setGeneric("spark.kstest", function(data, ...) { standardGeneric("spark.kstest")
 #' @export
 setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") })

+#' @rdname spark.linearSvc
+#' @export
+setGeneric("spark.linearSvc", function(data, formula, ...) { standardGeneric("spark.linearSvc") })


any name more R familiar that could be more fitting here?

svm with linear kernel in glmnet package has the same functionality as this one. But they support other kernels. linearSvm is better than linearSvc (c denotes classifier)?

maybe svmLinear? http://topepo.github.io/caret/available-models.html

svmLinear looks fine. I will change the files tomorrow. Thanks!

wangmiao1981 · 2017-02-07T06:59:38Z

mllib/src/main/scala/org/apache/spark/ml/r/LinearSVCWrapper.scala

+  import LinearSVCWrapper._
+
+  private val svcModel: LinearSVCModel =
+    pipeline.stages(1).asInstanceOf[LinearSVCModel]


The last state is id_to_index. So I need to use stages(1) to get the fitted model.

wangmiao1981 · 2017-02-07T07:00:17Z

R/pkg/inst/tests/testthat/test_mllib_classification.R

+  expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "character")
+  expected <- c("versicolor", "versicolor", "versicolor", "virginica",  "virginica",
+                "virginica",  "virginica",  "virginica",  "virginica",  "virginica")
+  expect_equal(sort(as.list(take(select(prediction, "prediction"), 10))[[1]]), expected)


Add sort to make it stable.

SparkQA · 2017-02-07T08:04:32Z

Test build #72492 has finished for PR 16800 at commit e6eea1d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-07T19:58:29Z

Test build #72529 has finished for PR 16800 at commit 35e669e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-09T00:03:26Z

Test build #72608 has finished for PR 16800 at commit a180126.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-09T01:15:39Z

Test build #72613 has finished for PR 16800 at commit bbc72e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-02-09T17:40:01Z

close to trigger windows test

wangmiao1981 · 2017-02-09T17:40:11Z

Open to trigger

wangmiao1981 · 2017-02-11T08:09:36Z

@felixcheung I have addressed the comments. cc @yanboliang @hhbyyh Thanks!

felixcheung · 2017-02-12T15:48:10Z

R/pkg/R/mllib_classification.R

+            formula <- paste(deparse(formula), collapse = "")
+
+            if (is.null(weightCol)) {
+              weightCol <- ""


is "" valid as weightCol on the model? would it be better to pass NULL and check for null at the wrapper, if null don't call setWeightCol?

I guess this is the same in spark.logit

I think null is better. There are several places like this. Let me double check. Then, I will fix them all in another PR. Thanks!

felixcheung · 2017-02-12T15:52:56Z

R/pkg/R/mllib_classification.R

+#' @note spark.svmLinear since 2.2.0
+setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formula"),
+          function(data, formula, regParam = 0.0, maxIter = 100, tol = 1E-6, standardization = TRUE,
+                   threshold = 0.5, weightCol = NULL) {


threshold defaults to 0.0 on the Scala side?

felixcheung · 2017-02-12T15:53:43Z

R/pkg/R/mllib_classification.R

+#'                        to the same solution when no regularization is applied. Default is TRUE, same as glmnet.
+#' @param threshold The threshold in binary classification, in range [0, 1].
+#' @param weightCol The weight column name.
+#' @param ... additional arguments passed to the method.


do you think we should add the expert param aggregationDepth?

In other algorithms, we don't add aggregationDepth. Shall I add them in another PR?

I don't think that would hurt. We have expert params in tree models.

felixcheung · 2017-02-12T15:56:49Z

mllib/src/main/scala/org/apache/spark/ml/r/LinearSVCWrapper.scala

+  private val svcModel: LinearSVCModel =
+    pipeline.stages(1).asInstanceOf[LinearSVCModel]
+
+  lazy val coefficients: Array[Double] = svcModel.coefficients.toArray


should this handle coefficients like in SPARK-19395?

I think this one different. We just want to return it as an Array (list in R).

I have made the same change as 19395.

felixcheung · 2017-02-12T15:57:43Z

R/pkg/R/mllib_classification.R

+#' @param standardization Whether to standardize the training features before fitting the model. The coefficients
+#'                        of models will be always returned on the original scale, so it will be transparent for
+#'                        users. Note that with/without standardization, the models should be always converged
+#'                        to the same solution when no regularization is applied. Default is TRUE, same as glmnet.


is this "same as glmnet" correct here?

Let me check.

glmnet help message: standardize: Logical flag for x variable standardization, prior to
fitting the model sequence. The coefficients are always
returned on the original scale. Default is
‘standardize=TRUE’.
I think they are the same.

but my point is glmnet is linear regression whereas here we are linear svc?
isn't it not a very good reference?

I got your point now. I removed the unnecessary document. Thanks!

felixcheung · 2017-02-12T16:00:59Z

R/pkg/R/mllib_utils.R

@@ -35,7 +35,8 @@
 #' @seealso \link{spark.als}, \link{spark.bisectingKmeans}, \link{spark.gaussianMixture},
 #' @seealso \link{spark.gbt}, \link{spark.glm}, \link{glm}, \link{spark.isoreg},
 #' @seealso \link{spark.kmeans},
-#' @seealso \link{spark.lda}, \link{spark.logit}, \link{spark.mlp}, \link{spark.naiveBayes},
+#' @seealso \link{spark.lda}, \link{spark.logit}, \link{spark.svmLinear},


OK. I will sort it. Before, it is named as linearSvc and I use editor to replace them automatically. Thanks!

felixcheung · 2017-02-12T16:01:07Z

R/pkg/R/mllib_utils.R

@@ -50,7 +51,7 @@ NULL
 #' @seealso \link{spark.als}, \link{spark.bisectingKmeans}, \link{spark.gaussianMixture},
 #' @seealso \link{spark.gbt}, \link{spark.glm}, \link{glm}, \link{spark.isoreg},
 #' @seealso \link{spark.kmeans},
-#' @seealso \link{spark.logit}, \link{spark.mlp}, \link{spark.naiveBayes},
+#' @seealso \link{spark.logit}, \link{spark.svmLinear}, \link{spark.mlp}, \link{spark.naiveBayes},


SparkQA · 2017-02-14T01:09:37Z

Test build #72838 has finished for PR 16800 at commit a00f3fd.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-14T01:59:20Z

Test build #72841 has finished for PR 16800 at commit 2fd5047.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-14T09:12:36Z

Test build #72860 has finished for PR 16800 at commit 2e7cec8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-02-14T19:25:17Z

R/pkg/R/generics.R

@@ -1380,6 +1380,10 @@ setGeneric("spark.kstest", function(data, ...) { standardGeneric("spark.kstest")
 #' @export
 setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") })

+#' @rdname spark.svmLinear
+#' @export
+setGeneric("spark.svmLinear", function(data, formula, ...) { standardGeneric("spark.svmLinear") })


oops, sorry I missed this - we should sort this too

Fixed. Thanks!

felixcheung · 2017-02-14T19:28:24Z

R/pkg/R/mllib_classification.R

+#' @note spark.svmLinear since 2.2.0
+setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formula"),
+          function(data, formula, regParam = 0.0, maxIter = 100, tol = 1E-6, standardization = TRUE,
+                   threshold = 0.5, weightCol = NULL, aggregationDepth = 2) {


shouldn't we change threashold = 0.0 to match scala as discussed here #16800 (comment)

Sorry for miss this one. I fix it now.

SparkQA · 2017-02-15T01:18:07Z

Test build #72905 has finished for PR 16800 at commit 9b2147f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-02-15T09:16:48Z

merged to master. could you open a JIRA on programming guide, example, vignettes changes please. Thanks

## What changes were proposed in this pull request? Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API. Marked as WIP, as I am designing unit tests. ## How was this patch tested? Please review http://spark.apache.org/contributing.html before opening a pull request. Author: [email protected] <[email protected]> Closes apache#16800 from wangmiao1981/svc.

wangmiao1981 · 2017-02-15T17:26:41Z

@felixcheung I will do the example and vignettes today. For the document, I will wait for @hhbyyh to merge his main document first. Thanks!

…ed for some SparkR APIs ## What changes were proposed in this pull request? This is a follow-up PR of #16800 When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter. ## How was this patch tested? Existing tests. Author: [email protected] <[email protected]> Closes #16945 from wangmiao1981/svc.

…ed for some SparkR APIs ## What changes were proposed in this pull request? This is a follow-up PR of apache#16800 When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter. ## How was this patch tested? Existing tests. Author: [email protected] <[email protected]> Closes apache#16945 from wangmiao1981/svc.

felixcheung reviewed Feb 4, 2017

View reviewed changes

wangmiao1981 changed the title ~~[SPARK-19456][SparkR][WIP]:Add LinearSVC R API~~ [SPARK-19456][SparkR]:Add LinearSVC R API Feb 7, 2017

wangmiao1981 commented Feb 7, 2017

View reviewed changes

wangmiao1981 closed this Feb 9, 2017

wangmiao1981 reopened this Feb 9, 2017

felixcheung requested changes Feb 12, 2017

View reviewed changes

wangmiao1981 added 9 commits February 13, 2017 14:26

initial check in

b712d74

start test

a7d3e9a

format fix

ca9c1e7

add unit test

68af04c

fix warning

44873f9

rename to spark.svmLinear

c276d49

namespace rename

8ba5173

fix unit test

0d77890

address review comments

a00f3fd

wangmiao1981 force-pushed the svc branch from bbc72e1 to a00f3fd Compare February 13, 2017 23:55

fix coef

2fd5047

remove unnecessary comment

2e7cec8

felixcheung requested changes Feb 14, 2017

View reviewed changes

fix default value, test and order

9b2147f

asfgit closed this in 3973403 Feb 15, 2017

wangmiao1981 mentioned this pull request Feb 15, 2017

[SPARK-19616][SparkR]:weightCol and aggregationDepth should be improved for some SparkR APIs #16945

Closed

[SPARK-19456][SparkR]:Add LinearSVC R API #16800

[SPARK-19456][SparkR]:Add LinearSVC R API #16800

Conversation

wangmiao1981 commented Feb 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 4, 2017

SparkQA commented Feb 4, 2017

wangmiao1981 commented Feb 4, 2017

SparkQA commented Feb 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2017

SparkQA commented Feb 7, 2017

SparkQA commented Feb 9, 2017

SparkQA commented Feb 9, 2017

wangmiao1981 commented Feb 9, 2017

wangmiao1981 commented Feb 9, 2017

wangmiao1981 commented Feb 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 14, 2017

SparkQA commented Feb 14, 2017

SparkQA commented Feb 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 15, 2017

felixcheung commented Feb 15, 2017

wangmiao1981 commented Feb 15, 2017