-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11715][SPARKR] Add R support corr for Column Aggregration #9680
Conversation
Test build #45789 has finished for PR 9680 at commit
|
I think 9366 is about computing corr or cov matrix whereas this is computing corr between two columns. They seem to be useful in their own ways. Also this is already supported in Scala and Python. |
in R the general formula for correlation is the following: |
these are two different issues. |
#' @family math_funcs | ||
#' @export | ||
#' @examples \dontrun{corr(df$c, df$d)} | ||
setMethod("corr", signature(x = "Column", col1 = "Column", col2 = "missing", method = "missing"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this signature looks confusing. Maybe change the generic function definition of "corr" is better:
setGeneric("corr", function(x, ...) {standardGeneric("corr") })
setMethod("corr",
signature(x = "DataFrame"),
function(x, col1, col2, method = "pearson") {
...
}
setMethod("corr", signature(x = "characterOrColumn"),
function(x, col2) {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern is how documentation for these "corr" methods are generated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I like the approach of changing the existing generic definition.
perhaps we should align the method signature with the stats::cor
cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
do you know why we decide to name it corr
(vs. cor
) in other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for doc, the DataFrame corr in stats.R has @rdname statfunctions
this one has @rdname corr
so they go to different HTML page generated by roxygen2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add cor() as alias for corr(), as you did in #9489
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, I think what's being added in #9366 matches https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.html better. When we are adding that support in R we could add it as cor
matching stats::cor.
Meanwhile I'll change corr
to what you suggested with function(x, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#9366 only supports inter-column cov and cor of a DataFrame, not between columns of two DataFrames. I think actually it is better to add alias in this PR. corr() operating on two columns is similar to R cor() on two vectors.
@sun-rui I updated it. I think it's a bit not as strongly typed as I'd like but if I add
|
Test build #46055 has finished for PR 9680 at commit
|
any more comment? |
#' @family math_funcs | ||
#' @export | ||
#' @examples \dontrun{corr(df$c, df$d)} | ||
setMethod("corr", signature(x = "Column"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two versions of corr():
def corr(column1: Column, column2: Column): Column
def corr(columnName1: String, columnName2: String): Column
We'd better support both. Something like:
setMethod("corr", signature(x = "characterOrColumn"),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the same for count
, max
, mean
and so on, so change we would need to change every function here - should we do that?
@shivaram Can we go ahead with this? I think we could consider adding all character overload for DataFrame functions in a different JIRA. |
adding all character overload for DataFrame functions in a different JIRA is OK. But for alias of corr(), #9366 only supports inter-column cov and cor of a DataFrame, not between columns of two DataFrames. I think actually it is better to add alias in this PR. corr() operating on two columns is similar to R cor() on two vectors. |
As per this https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.html |
Also, since here we are working with 2 columns, by adding a alias |
@sun-rui ? I"m fine with adding |
@felixcheung, sorry for late response. Since there is no agreement now, I am fine that we don't add "cor" alias in this PR. Let's get this PR merged. Could you submit a new JIRA addressing the issue of adding alias of "cor" and also the issue of existing "cov" which masks stats::cov? |
LGTM |
stats::cov name conflict: https://issues.apache.org/jira/browse/SPARK-11886 |
thanks, rebased. |
Test build #46911 has finished for PR 9680 at commit
|
second is a git error (seems like having a lot these days?) |
Jenkins, retest this please |
Test build #46922 has finished for PR 9680 at commit
|
LGTM. @felixcheung I think the current resolution of not adding Merging this to master, branch-1.6 |
Need to match existing method signature Author: felixcheung <[email protected]> Closes #9680 from felixcheung/rcorr. (cherry picked from commit 895b6c4) Signed-off-by: Shivaram Venkataraman <[email protected]>
Need to match existing method signature