Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19456][SparkR]:Add LinearSVC R API #16800

Closed
wants to merge 12 commits into from

Conversation

wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.

Marked as WIP, as I am designing unit tests.

How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72339 has finished for PR 16800 at commit a4dceec.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72340 has finished for PR 16800 at commit 5b4c406.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72361 has finished for PR 16800 at commit 5b4c406.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1376,6 +1376,10 @@ setGeneric("spark.kstest", function(data, ...) { standardGeneric("spark.kstest")
#' @export
setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") })

#' @rdname spark.linearSvc
#' @export
setGeneric("spark.linearSvc", function(data, formula, ...) { standardGeneric("spark.linearSvc") })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any name more R familiar that could be more fitting here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

svm with linear kernel in glmnet package has the same functionality as this one. But they support other kernels. linearSvm is better than linearSvc (c denotes classifier)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

svmLinear looks fine. I will change the files tomorrow. Thanks!

@wangmiao1981 wangmiao1981 changed the title [SPARK-19456][SparkR][WIP]:Add LinearSVC R API [SPARK-19456][SparkR]:Add LinearSVC R API Feb 7, 2017
import LinearSVCWrapper._

private val svcModel: LinearSVCModel =
pipeline.stages(1).asInstanceOf[LinearSVCModel]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last state is id_to_index. So I need to use stages(1) to get the fitted model.

expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "character")
expected <- c("versicolor", "versicolor", "versicolor", "virginica", "virginica",
"virginica", "virginica", "virginica", "virginica", "virginica")
expect_equal(sort(as.list(take(select(prediction, "prediction"), 10))[[1]]), expected)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add sort to make it stable.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72492 has finished for PR 16800 at commit e6eea1d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72529 has finished for PR 16800 at commit 35e669e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72608 has finished for PR 16800 at commit a180126.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72613 has finished for PR 16800 at commit bbc72e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

close to trigger windows test

@wangmiao1981
Copy link
Contributor Author

Open to trigger

@wangmiao1981 wangmiao1981 reopened this Feb 9, 2017
@wangmiao1981
Copy link
Contributor Author

@felixcheung I have addressed the comments. cc @yanboliang @hhbyyh Thanks!

formula <- paste(deparse(formula), collapse = "")

if (is.null(weightCol)) {
weightCol <- ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "" valid as weightCol on the model? would it be better to pass NULL and check for null at the wrapper, if null don't call setWeightCol?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the same in spark.logit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think null is better. There are several places like this. Let me double check. Then, I will fix them all in another PR. Thanks!

#' @note spark.svmLinear since 2.2.0
setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, regParam = 0.0, maxIter = 100, tol = 1E-6, standardization = TRUE,
threshold = 0.5, weightCol = NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threshold defaults to 0.0 on the Scala side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

#' to the same solution when no regularization is applied. Default is TRUE, same as glmnet.
#' @param threshold The threshold in binary classification, in range [0, 1].
#' @param weightCol The weight column name.
#' @param ... additional arguments passed to the method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should add the expert param aggregationDepth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other algorithms, we don't add aggregationDepth. Shall I add them in another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would hurt. We have expert params in tree models.

private val svcModel: LinearSVCModel =
pipeline.stages(1).asInstanceOf[LinearSVCModel]

lazy val coefficients: Array[Double] = svcModel.coefficients.toArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this handle coefficients like in SPARK-19395?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one different. We just want to return it as an Array (list in R).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the same change as 19395.

#' @param standardization Whether to standardize the training features before fitting the model. The coefficients
#' of models will be always returned on the original scale, so it will be transparent for
#' users. Note that with/without standardization, the models should be always converged
#' to the same solution when no regularization is applied. Default is TRUE, same as glmnet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this "same as glmnet" correct here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glmnet help message: standardize: Logical flag for x variable standardization, prior to
fitting the model sequence. The coefficients are always
returned on the original scale. Default is
‘standardize=TRUE’.
I think they are the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but my point is glmnet is linear regression whereas here we are linear svc?
isn't it not a very good reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point now. I removed the unnecessary document. Thanks!

@@ -35,7 +35,8 @@
#' @seealso \link{spark.als}, \link{spark.bisectingKmeans}, \link{spark.gaussianMixture},
#' @seealso \link{spark.gbt}, \link{spark.glm}, \link{glm}, \link{spark.isoreg},
#' @seealso \link{spark.kmeans},
#' @seealso \link{spark.lda}, \link{spark.logit}, \link{spark.mlp}, \link{spark.naiveBayes},
#' @seealso \link{spark.lda}, \link{spark.logit}, \link{spark.svmLinear},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will sort it. Before, it is named as linearSvc and I use editor to replace them automatically. Thanks!

@@ -50,7 +51,7 @@ NULL
#' @seealso \link{spark.als}, \link{spark.bisectingKmeans}, \link{spark.gaussianMixture},
#' @seealso \link{spark.gbt}, \link{spark.glm}, \link{glm}, \link{spark.isoreg},
#' @seealso \link{spark.kmeans},
#' @seealso \link{spark.logit}, \link{spark.mlp}, \link{spark.naiveBayes},
#' @seealso \link{spark.logit}, \link{spark.svmLinear}, \link{spark.mlp}, \link{spark.naiveBayes},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort?

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72838 has finished for PR 16800 at commit a00f3fd.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72841 has finished for PR 16800 at commit 2fd5047.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72860 has finished for PR 16800 at commit 2e7cec8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1380,6 +1380,10 @@ setGeneric("spark.kstest", function(data, ...) { standardGeneric("spark.kstest")
#' @export
setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") })

#' @rdname spark.svmLinear
#' @export
setGeneric("spark.svmLinear", function(data, formula, ...) { standardGeneric("spark.svmLinear") })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, sorry I missed this - we should sort this too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

#' @note spark.svmLinear since 2.2.0
setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, regParam = 0.0, maxIter = 100, tol = 1E-6, standardization = TRUE,
threshold = 0.5, weightCol = NULL, aggregationDepth = 2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we change threashold = 0.0 to match scala as discussed here #16800 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for miss this one. I fix it now.

@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72905 has finished for PR 16800 at commit 9b2147f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

merged to master. could you open a JIRA on programming guide, example, vignettes changes please. Thanks

@asfgit asfgit closed this in 3973403 Feb 15, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.

Marked as WIP, as I am designing unit tests.

## How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: [email protected] <[email protected]>

Closes apache#16800 from wangmiao1981/svc.
@wangmiao1981
Copy link
Contributor Author

@felixcheung I will do the example and vignettes today. For the document, I will wait for @hhbyyh to merge his main document first. Thanks!

asfgit pushed a commit that referenced this pull request Feb 22, 2017
…ed for some SparkR APIs

## What changes were proposed in this pull request?

This is a follow-up PR of #16800

When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter.

## How was this patch tested?
Existing tests.

Author: [email protected] <[email protected]>

Closes #16945 from wangmiao1981/svc.
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
…ed for some SparkR APIs

## What changes were proposed in this pull request?

This is a follow-up PR of apache#16800

When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter.

## How was this patch tested?
Existing tests.

Author: [email protected] <[email protected]>

Closes apache#16945 from wangmiao1981/svc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants