-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19456][SparkR]:Add LinearSVC R API #16800
Conversation
Test build #72339 has finished for PR 16800 at commit
|
Test build #72340 has finished for PR 16800 at commit
|
retest this please |
Test build #72361 has finished for PR 16800 at commit
|
R/pkg/R/generics.R
Outdated
@@ -1376,6 +1376,10 @@ setGeneric("spark.kstest", function(data, ...) { standardGeneric("spark.kstest") | |||
#' @export | |||
setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") }) | |||
|
|||
#' @rdname spark.linearSvc | |||
#' @export | |||
setGeneric("spark.linearSvc", function(data, formula, ...) { standardGeneric("spark.linearSvc") }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any name more R familiar that could be more fitting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
svm with linear kernel in glmnet package has the same functionality as this one. But they support other kernels. linearSvm
is better than linearSvc
(c denotes classifier)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe svmLinear
? http://topepo.github.io/caret/available-models.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
svmLinear
looks fine. I will change the files tomorrow. Thanks!
import LinearSVCWrapper._ | ||
|
||
private val svcModel: LinearSVCModel = | ||
pipeline.stages(1).asInstanceOf[LinearSVCModel] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last state is id_to_index. So I need to use stages(1) to get the fitted model.
expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "character") | ||
expected <- c("versicolor", "versicolor", "versicolor", "virginica", "virginica", | ||
"virginica", "virginica", "virginica", "virginica", "virginica") | ||
expect_equal(sort(as.list(take(select(prediction, "prediction"), 10))[[1]]), expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add sort
to make it stable.
Test build #72492 has finished for PR 16800 at commit
|
Test build #72529 has finished for PR 16800 at commit
|
Test build #72608 has finished for PR 16800 at commit
|
Test build #72613 has finished for PR 16800 at commit
|
close to trigger windows test |
Open to trigger |
@felixcheung I have addressed the comments. cc @yanboliang @hhbyyh Thanks! |
R/pkg/R/mllib_classification.R
Outdated
formula <- paste(deparse(formula), collapse = "") | ||
|
||
if (is.null(weightCol)) { | ||
weightCol <- "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is ""
valid as weightCol on the model? would it be better to pass NULL and check for null at the wrapper, if null don't call setWeightCol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is the same in spark.logit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think null is better. There are several places like this. Let me double check. Then, I will fix them all in another PR. Thanks!
R/pkg/R/mllib_classification.R
Outdated
#' @note spark.svmLinear since 2.2.0 | ||
setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formula"), | ||
function(data, formula, regParam = 0.0, maxIter = 100, tol = 1E-6, standardization = TRUE, | ||
threshold = 0.5, weightCol = NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
threshold defaults to 0.0 on the Scala side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
R/pkg/R/mllib_classification.R
Outdated
#' to the same solution when no regularization is applied. Default is TRUE, same as glmnet. | ||
#' @param threshold The threshold in binary classification, in range [0, 1]. | ||
#' @param weightCol The weight column name. | ||
#' @param ... additional arguments passed to the method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we should add the expert param aggregationDepth
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other algorithms, we don't add aggregationDepth
. Shall I add them in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that would hurt. We have expert params in tree models.
private val svcModel: LinearSVCModel = | ||
pipeline.stages(1).asInstanceOf[LinearSVCModel] | ||
|
||
lazy val coefficients: Array[Double] = svcModel.coefficients.toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this handle coefficients like in SPARK-19395?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one different. We just want to return it as an Array (list in R).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the same change as 19395.
R/pkg/R/mllib_classification.R
Outdated
#' @param standardization Whether to standardize the training features before fitting the model. The coefficients | ||
#' of models will be always returned on the original scale, so it will be transparent for | ||
#' users. Note that with/without standardization, the models should be always converged | ||
#' to the same solution when no regularization is applied. Default is TRUE, same as glmnet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this "same as glmnet" correct here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glmnet help message: standardize: Logical flag for x variable standardization, prior to
fitting the model sequence. The coefficients are always
returned on the original scale. Default is
‘standardize=TRUE’.
I think they are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but my point is glmnet is linear regression whereas here we are linear svc?
isn't it not a very good reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point now. I removed the unnecessary document. Thanks!
R/pkg/R/mllib_utils.R
Outdated
@@ -35,7 +35,8 @@ | |||
#' @seealso \link{spark.als}, \link{spark.bisectingKmeans}, \link{spark.gaussianMixture}, | |||
#' @seealso \link{spark.gbt}, \link{spark.glm}, \link{glm}, \link{spark.isoreg}, | |||
#' @seealso \link{spark.kmeans}, | |||
#' @seealso \link{spark.lda}, \link{spark.logit}, \link{spark.mlp}, \link{spark.naiveBayes}, | |||
#' @seealso \link{spark.lda}, \link{spark.logit}, \link{spark.svmLinear}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will sort it. Before, it is named as linearSvc
and I use editor to replace them automatically. Thanks!
R/pkg/R/mllib_utils.R
Outdated
@@ -50,7 +51,7 @@ NULL | |||
#' @seealso \link{spark.als}, \link{spark.bisectingKmeans}, \link{spark.gaussianMixture}, | |||
#' @seealso \link{spark.gbt}, \link{spark.glm}, \link{glm}, \link{spark.isoreg}, | |||
#' @seealso \link{spark.kmeans}, | |||
#' @seealso \link{spark.logit}, \link{spark.mlp}, \link{spark.naiveBayes}, | |||
#' @seealso \link{spark.logit}, \link{spark.svmLinear}, \link{spark.mlp}, \link{spark.naiveBayes}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort?
Test build #72838 has finished for PR 16800 at commit
|
Test build #72841 has finished for PR 16800 at commit
|
Test build #72860 has finished for PR 16800 at commit
|
R/pkg/R/generics.R
Outdated
@@ -1380,6 +1380,10 @@ setGeneric("spark.kstest", function(data, ...) { standardGeneric("spark.kstest") | |||
#' @export | |||
setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") }) | |||
|
|||
#' @rdname spark.svmLinear | |||
#' @export | |||
setGeneric("spark.svmLinear", function(data, formula, ...) { standardGeneric("spark.svmLinear") }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, sorry I missed this - we should sort this too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
R/pkg/R/mllib_classification.R
Outdated
#' @note spark.svmLinear since 2.2.0 | ||
setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formula"), | ||
function(data, formula, regParam = 0.0, maxIter = 100, tol = 1E-6, standardization = TRUE, | ||
threshold = 0.5, weightCol = NULL, aggregationDepth = 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we change threashold = 0.0 to match scala as discussed here #16800 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for miss this one. I fix it now.
Test build #72905 has finished for PR 16800 at commit
|
merged to master. could you open a JIRA on programming guide, example, vignettes changes please. Thanks |
## What changes were proposed in this pull request? Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API. Marked as WIP, as I am designing unit tests. ## How was this patch tested? Please review http://spark.apache.org/contributing.html before opening a pull request. Author: [email protected] <[email protected]> Closes apache#16800 from wangmiao1981/svc.
@felixcheung I will do the example and vignettes today. For the document, I will wait for @hhbyyh to merge his main document first. Thanks! |
…ed for some SparkR APIs ## What changes were proposed in this pull request? This is a follow-up PR of #16800 When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter. ## How was this patch tested? Existing tests. Author: [email protected] <[email protected]> Closes #16945 from wangmiao1981/svc.
…ed for some SparkR APIs ## What changes were proposed in this pull request? This is a follow-up PR of apache#16800 When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter. ## How was this patch tested? Existing tests. Author: [email protected] <[email protected]> Closes apache#16945 from wangmiao1981/svc.
What changes were proposed in this pull request?
Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.
Marked as WIP, as I am designing unit tests.
How was this patch tested?
Please review http://spark.apache.org/contributing.html before opening a pull request.