[SPARK-19636][ML] Feature parity for correlation statistics in MLlib #17108

thunterdb · 2017-03-01T00:43:07Z

What changes were proposed in this pull request?

This patch adds the Dataframes-based support for the correlation statistics found in the org.apache.spark.mllib.stat.correlation.Statistics, following the design doc discussed in the JIRA ticket.

The current implementation is a simple wrapper around the spark.mllib implementation. Future optimizations can be implemented at a later stage.

How was this patch tested?

build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite"

SparkQA · 2017-03-01T01:45:13Z

Test build #73627 has finished for PR 17108 at commit 7d4ccfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

thunterdb · 2017-03-01T02:00:42Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala

+package org.apache.spark.ml.stat
+
+/**
+ *


oops, sorry, removing this file.

imatiach-msft · 2017-03-01T17:42:01Z

mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala

+ * API for statistical functions in MLlib, compatible with Dataframes and Datasets.
+ *
+ * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]]
+ * to MLlib's Vector types.


minor terminology comment: should this be ML instead of MLLib? I understand this is for the new ML vector types?

I will use spark.ml which is the most correct terminology.

imatiach-msft · 2017-03-01T17:43:33Z

mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala

+   * Compute the correlation matrix for the input RDD of Vectors using the specified method.
+   * Methods currently supported: `pearson` (default), `spearman`.
+   *
+   * @param dataset a dataset or a dataframe


very minor: "Sentence case" params, as in "A dataset...", "The name..."

it seems there are inconsistencies in a lot of comments. I wish we had something like scalastyle for comments...

oh yes, thank you. I am correcting the other instances of course.

imatiach-msft · 2017-03-01T17:45:09Z

mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala

+  def corr(dataset: Dataset[_], column: String, method: String): DataFrame = {
+    val rdd = dataset.select(column).rdd.map {
+      case Row(v: Vector) => OldVectors.fromML(v)
+//      case r: GenericRowWithSchema => OldVectors.fromML(r.getAs[Vector](0))


remove commented out code (?)

imatiach-msft · 2017-03-01T17:47:27Z

mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala

+    }
+    val oldM = OldStatistics.corr(rdd, method)
+    val name = s"$method($column)"
+    val schema = StructType(Array(StructField(name, SQLDataTypes.MatrixType, nullable = true)))


minor comment: ideally shouldn't you check for collisions prior to creating the name - eg add a suffix such as "_2" or _i if the column name already exists

ideally this would be an infrastructure-level method that just finds a new column name and would be reusable in other code. I don't believe something like this exists.

imatiach-msft · 2017-03-01T18:08:28Z

mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala

+   * which is fairly costly. Cache the input RDD before calling corr with `method = "spearman"` to
+   * avoid recomputing the common lineage.
+   */
+  // TODO: how do we handle missing values?


this is more a comment for the internal implementation of the pearson/spearman calculation, I don't think it should be at this level (maybe moved into the MLLib code?). I think they should just ignore the rows where one of the columns compared have a missing/nan value and log a warning (but only once) when they encounter this -- if all are missing, we just assign a 0 score.

Good point. I will remove the comment at this point, since this should be decided in JIRA instead of during the implementation.

imatiach-msft · 2017-03-01T18:21:12Z

mllib/src/test/scala/org/apache/spark/ml/stat/StatisticsSuite.scala

+
+object StatisticsSuite extends Logging {
+
+  def approxEqual(v1: Double, v2: Double, threshold: Double = 1e-6): Boolean = {


these are very nice methods! would it be possible to move them to a place where every test suite could use them? Specifically the matrixApproxEqual.

imatiach-msft · 2017-03-01T18:31:51Z

mllib/src/test/scala/org/apache/spark/ml/stat/StatisticsSuite.scala

+  test("corr(X) default, pearson") {
+    val defaultMat = Statistics.corr(X, "features")
+    val pearsonMat = Statistics.corr(X, "features", "pearson")
+    // scalastyle:off


what is the error that the scalastyle gives? I wish there was some way to avoid turning it off.

The problem is the alignment of the values, which we realize by padding with 0's.

imatiach-msft · 2017-03-01T18:33:17Z

The changes look good to me. I just had a few minor comments. I wish we could just natively implement the correlations in spark to avoid extra copying between the old and new implementations, but this seems like a move in the right direction.

imatiach-msft · 2017-03-01T18:58:25Z

mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala

+ * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]]
+ * to MLlib's Vector types.
+ */
+@Since("2.2.0")


shouldn't this have @experimental tag at the top? similar to:
https://github.com/apache/spark/pull/17110/files

Good point, thanks

jkbradley · 2017-03-08T23:11:10Z

Given further thought, I'd prefer we stick to the API specified in the design doc, with a Correlations object instead of a generic Statistics object. In the future, we may want optional Params such as weightCol, in which case we may switch to a builder pattern for Correlations and ChiSquare and move away from a shared Statistics object. I'm going to proceed with #17110 using a separate ChiSquare object.

thunterdb · 2017-03-15T23:28:36Z

I moved the code Correlations as suggested. @imatiach-msft , I addressed your comments.

SparkQA · 2017-03-16T00:18:37Z

Test build #74626 has finished for PR 17108 at commit a85a889.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-16T00:24:10Z

Test build #74627 has finished for PR 17108 at commit 2aeb6ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-21T05:04:19Z

Taking a look now

imatiach-msft · 2017-03-21T14:04:47Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala

+   */
+  @Since("2.2.0")
+  def corr(dataset: Dataset[_], column: String, method: String): DataFrame = {
+    val rdd = dataset.select(column).rdd.map {


not related to the code, but does this generate a new rdd or just reference the data in the input dataset? Also, in performance testing, I noticed a lot of operations on rdds are more expensive than on dataframe and dataset (probably because optimizations from catalyst are not used), so it seems we should try to avoid using rdds when doing computations, is this true?

imatiach-msft · 2017-03-21T14:11:48Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala

+  @Since("2.2.0")
+  def corr(dataset: Dataset[_], column: String, method: String): DataFrame = {
+    val rdd = dataset.select(column).rdd.map {
+      case Row(v: Vector) => OldVectors.fromML(v)


if this is not a Row of vector, should we throw a nice error message? Otherwise the map will fail.

imatiach-msft · 2017-03-21T14:12:18Z

mllib/src/test/scala/org/apache/spark/ml/stat/CorrelationsSuite.scala

+import org.apache.spark.sql.{DataFrame, Row}
+
+
+class CorrelationsSuite extends SparkFunSuite with MLlibTestSparkContext with Logging {


maybe a negative test case where we pass a single column instead of a vector in a column?

imatiach-msft · 2017-03-21T14:13:59Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala

+  }
+
+  /**
+   * Compute the correlation matrix for the input Dataset of Vectors.


should this specify "pearson" correlation in the documentation to be precise?

imatiach-msft · 2017-03-21T14:16:36Z

mllib/src/test/scala/org/apache/spark/ml/util/LinalgUtils.scala

+/**
+ * Utility test methods for linear algebra.
+ */
+object LinalgUtils extends Logging {


this is nice, thank you for refactoring the test code here!

imatiach-msft · 2017-03-21T14:43:12Z

the code looks good to me, I added some minor comments, thank you!

jkbradley

Done with review; just a few comments. Thanks!

jkbradley · 2017-03-21T05:04:13Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala

+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and Datasets.


This should be limited to correlations

jkbradley · 2017-03-21T05:05:09Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala

+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and Datasets.


Add :: Experimental ::

jkbradley · 2017-03-21T05:17:41Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala

+ */
+@Since("2.2.0")
+@Experimental
+object Correlations {


How about calling it "Correlation" (singular)? Especially if we add a builder pattern, then I feel like new Correlation().set... seems more natural.

sure, I do not know if there is a convention for that.

Not really, but let's make one?

jkbradley · 2017-03-21T21:44:10Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala

+    }
+    val oldM = OldStatistics.corr(rdd, method)
+    val name = s"$method($column)"
+    val schema = StructType(Array(StructField(name, SQLDataTypes.MatrixType, nullable = true)))


nullable = false?

Good point. It seems that Spark can be quite liberal with the nullability.

jkbradley · 2017-03-21T21:45:14Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala

+  }
+
+  /**
+   * Compute the correlation matrix for the input Dataset of Vectors.


Just say that this is a version of corr which defaults to "pearson" for the method. Don't document params or return value.

jkbradley · 2017-03-21T21:46:54Z

mllib/src/test/scala/org/apache/spark/ml/util/LinalgUtils.scala

+/**
+ * Utility test methods for linear algebra.
+ */
+object LinalgUtils extends Logging {


Can't you use org.apache.spark.ml.util.TestingUtils from mllib-local?

You are right, I had missed that file.

thunterdb · 2017-03-22T19:10:14Z

mllib-local/src/test/scala/org/apache/spark/ml/util/TestingUtils.scala

@@ -32,6 +32,10 @@ object TestingUtils {
   * the relative tolerance is meaningless, so the exception will be raised to warn users.
   */
  private def RelativeErrorComparison(x: Double, y: Double, eps: Double): Boolean = {
+    // Special case for NaNs


@jkbradley I do not think this change is going to be controversial, but I want to point out that from now on, matrix/vector checks will not always throw errors when comparing NaN: the previous code would throw whenever a NaN was found.

I agree with you that the update has the right semantics. SGTM

SparkQA · 2017-03-22T20:03:04Z

Test build #75060 has finished for PR 17108 at commit 2151e8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-23T16:34:01Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala

+  }
+
+  /**
+   * Compute the correlation matrix for the input Dataset of Vectors.


Say "pearson" here explicitly.

jkbradley · 2017-03-23T16:35:02Z

LGTM except for the one doc nit.
When you update this, could you also please make and link JIRAs for the Python wrapper and doc update?

jkbradley · 2017-03-23T20:55:01Z

LGTM will merge after tests
Thanks!

thunterdb · 2017-03-23T21:44:51Z

Tickets created:

SparkQA · 2017-03-23T21:46:47Z

Test build #75118 has finished for PR 17108 at commit 7c540e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-24T01:41:42Z

Merging with master
Thanks!

thunterdb added 3 commits February 23, 2017 15:44

changes

42c26bd

commit

d9f6a6c

Cleanup

7d4ccfe

thunterdb commented Mar 1, 2017

View reviewed changes

imatiach-msft reviewed Mar 1, 2017

View reviewed changes

jkbradley mentioned this pull request Mar 3, 2017

[SPARK-19635][ML] DataFrame-based API for chi square test #17110

Closed

thunterdb added 3 commits March 15, 2017 16:14

Merge remote-tracking branch 'upstream/master' into 19636

a2d7e2d

moving code

a85a889

renaming

2aeb6ee

imatiach-msft reviewed Mar 21, 2017

View reviewed changes

imatiach-msft approved these changes Mar 21, 2017

View reviewed changes

jkbradley reviewed Mar 21, 2017

View reviewed changes

thunterdb added 3 commits March 22, 2017 11:37

Merge remote-tracking branch 'upstream/master' into 19636

903e6d0

comments

6040e4c

unused file

2151e8a

thunterdb commented Mar 22, 2017

View reviewed changes

jkbradley reviewed Mar 23, 2017

View reviewed changes

comment

7c540e5

asfgit closed this in d27daa5 Mar 24, 2017

viirya mentioned this pull request Mar 31, 2017

[SPARK-20076][ML][PySpark] Add Python interface for ml.stats.Correlation #17494

Closed


		object StatisticsSuite extends Logging {

		def approxEqual(v1: Double, v2: Double, threshold: Double = 1e-6): Boolean = {

		import org.apache.spark.sql.{DataFrame, Row}


		class CorrelationsSuite extends SparkFunSuite with MLlibTestSparkContext with Logging {

[SPARK-19636][ML] Feature parity for correlation statistics in MLlib #17108

[SPARK-19636][ML] Feature parity for correlation statistics in MLlib #17108

Conversation

thunterdb commented Mar 1, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imatiach-msft commented Mar 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Mar 8, 2017

thunterdb commented Mar 15, 2017

SparkQA commented Mar 16, 2017

SparkQA commented Mar 16, 2017

jkbradley commented Mar 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imatiach-msft Mar 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imatiach-msft commented Mar 21, 2017

jkbradley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 22, 2017

Choose a reason for hiding this comment

jkbradley commented Mar 23, 2017

jkbradley commented Mar 23, 2017

thunterdb commented Mar 23, 2017

SparkQA commented Mar 23, 2017

jkbradley commented Mar 24, 2017

imatiach-msft Mar 21, 2017 •

edited

Loading