-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11057] [SQL] Add correlation and covariance matrices #9366
Conversation
cc @mengxr |
Test build #44651 has finished for PR 9366 at commit
|
Hi guys, would you share your thoughts about this ? |
In general I think that currently there are some issues in the StatFunctions.scala: It seems that all computations both for covariance and correlation are being accomplished in one place which makes it a little confusing and harder to extend for the future. collectStatisticalData method is called for both correlation and covariance and even if I call something like this: Here is an example: I think we can actually separate the computations. Is there a reason why these computations are being accomplished in one place ? @rxin, @mengxr |
// fills the covariance matrix by computing column-by-column covariances | ||
for (i <- 0 to fieldNames.length-1){ | ||
for (j <- 0 to i){ | ||
val cov = calculateCov(df, Seq(fieldNames(i), fieldNames(j))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't assume all columns are of numeric type. Catch exception here and use null as value if exception happens?
Hi @sun-rui, |
what do you think ? |
Yes, since R throws error message in this case, we can leave exception un-handled. No need to verify all column types. User will get exception message at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L81 |
yes, there is even a test case which covers that case. |
can someone from Spark SQL committers or experts also look at this ? |
Test build #53308 has finished for PR 9366 at commit
|
Test build #56142 has finished for PR 9366 at commit
|
cc @mengxr |
I have been trying to use correlation on a matrix with many columns. @NarineK menthioned R like correlation. I wish we had something like what pandas offers. It handles missing data automatically. Take a look here. Even the corr() function from MLlib can not handle missing data. These features are really missing from SparkSQL:
|
@NarineK Are you still working on this? cc @yanboliang |
We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks! |
Hi there,
As we know R has the option to calculate the correlation and covariance for all columns of a dataframe or between columns of two dataframes.
If we look at apache math package we can see that, they have that too.
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
In case we have as input only one DataFrame:
for correlation:
cor[i,j] = cor[j,i]
and for the main diagonal we can have 1s.
for covariance:
cov[i,j] = cov[j,i]
and for main diagonal: we can compute the variance for that specific column:
See:
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
Thanks,
Narine