[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC #21119

huaxingao · 2018-04-21T04:22:07Z

What changes were proposed in this pull request?

add spark.ml Python API for PIC

How was this patch tested?

add doctest

SparkQA · 2018-04-21T04:37:30Z

Test build #89672 has finished for PR 21119 at commit 53d7763.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _PowerIterationClusteringParams(JavaParams, HasMaxIter, HasPredictionCol):
class PowerIterationClustering(JavaTransformer, _PowerIterationClusteringParams, JavaMLReadable,

SparkQA · 2018-04-23T21:32:01Z

Test build #89735 has finished for PR 21119 at commit 2d0e394.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-23T21:39:23Z

Test build #89737 has finished for PR 21119 at commit 387d6ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2018-04-25T16:16:28Z

@jkbradley Could you please review when you have time? Thank you very much in advance!

WeichenXu123

Thanks! I made a pass.

WeichenXu123 · 2018-04-26T10:29:38Z

python/pyspark/ml/clustering.py

@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
        return self.getOrDefault(self.keepLastCheckpoint)


+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, HasPredictionCol):


Why not directly add params into class PowerIterationClustering?

@WeichenXu123 Thanks for your review. The params can be either inside class PowerIterationClustering or separate. I will move them back inside class PowerIterationClustering, to be consistent with the params in the other classes in clustering.

WeichenXu123 · 2018-04-26T10:36:06Z

python/pyspark/ml/clustering.py

+class PowerIterationClustering(JavaTransformer, _PowerIterationClusteringParams, JavaMLReadable,
+                               JavaMLWritable):
+    """
+    Model produced by [[PowerIterationClustering]].


The doc is wrong. Copy doc from scala side.

WeichenXu123 · 2018-04-26T10:39:55Z

python/pyspark/ml/clustering.py

+                  idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"):
+        """
+        setParams(self, predictionCol="prediction", k=2, maxIter=20, initMode="random",\
+                  idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"):


remove : at the end.

WeichenXu123 · 2018-04-26T10:40:42Z

python/pyspark/ml/clustering.py

+                 idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"):
+        """
+        __init__(self, predictionCol="prediction", k=2, maxIter=20, initMode="random",\
+                 idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"):


remove : at the end.

WeichenXu123 · 2018-04-26T10:44:32Z

python/pyspark/ml/clustering.py

+    @since("2.4.0")
+    def getK(self):
+        """
+        Gets the value of `k`


Should use:
:py:attr:k
and update everywhere else.

WeichenXu123 · 2018-04-26T10:47:21Z

python/pyspark/ml/clustering.py

+    ...    for j in range (i):
+    ...        neighbor.append((long)(j))
+    ...        weight.append(sim(points[i], points[j]))
+    ...    similarities.append([(long)(i), neighbor, weight])


The doctest code looks like too long, maybe more proper to put it in examples.
Could you replace the data generation code here by using a simple hardcoded dataset ?

@WeichenXu123 I will move this to tests, and add a simple example in the doctest.

SparkQA · 2018-04-27T21:41:26Z

Test build #89943 has finished for PR 21119 at commit 6d00f34.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PowerIterationClustering(HasMaxIter, HasPredictionCol, JavaTransformer, JavaParams,

SparkQA · 2018-04-28T01:15:59Z

Test build #89946 has finished for PR 21119 at commit a6b1822.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-04-28T03:57:27Z

mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala

@@ -97,13 +97,15 @@ private[clustering] trait PowerIterationClusteringParams extends Params with Has
  def getNeighborsCol: String = $(neighborsCol)

  /**
-   * Param for the name of the input column for neighbors in the adjacency list representation.
+   * Param for the name of the input column for non-negative weights (similarities) of edges
+   * between the vertex in `idCol` and each neighbor in `neighborsCol`.


Good catch!

viirya · 2018-04-28T04:03:12Z

python/pyspark/ml/clustering.py

+    @since("2.4.0")
+    def setNeighborsCol(self, value):
+        """
+        Sets the value of :py:attr:`neighborsCol.


Missing the left back quote.

viirya · 2018-04-28T04:03:59Z

python/pyspark/ml/clustering.py

+        """
+        Gets the value of :py:attr:`similaritiesCol`.
+        """
+        return self.getOrDefault(self.binary)


self.binary -> self.similaritiesCol?

viirya · 2018-04-28T04:10:55Z

python/pyspark/ml/clustering.py

+    PIC takes this matrix (or graph) as an adjacency matrix.  Specifically, each input row
+    includes:
+
+     - :py:class:`idCol`: vertex ID


:py:attr:`idCol` ? And also the below :py:class:`neighborsCol` , etc...

viirya · 2018-04-28T04:12:05Z

python/pyspark/ml/clustering.py

+     - Input validation: This validates that similarities are non-negative but does NOT validate
+        that the input matrix is symmetric.
+
+    @see <a href=http://en.wikipedia.org/wiki/Spectral_clustering>


Use .. seealso::?

viirya · 2018-04-28T04:12:18Z

python/pyspark/ml/clustering.py

+    :py:class:`predictionCol` containing the cluster assignment in :py:class:`[0,k)` for
+    each row (vertex).
+
+    Notes:


Use .. note::?

SparkQA · 2018-04-29T06:50:57Z

Test build #89965 has finished for PR 21119 at commit c25d3dc.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-04-29T07:55:41Z

python/pyspark/ml/clustering.py

+        that the input matrix is symmetric.
+
+    .. seealso:: <a href=http://en.wikipedia.org/wiki/Spectral_clustering>
+    Spectral clustering (Wikipedia)</a>


You can check other places using seealso:

.. seealso:: `Spectral clustering \ <http://en.wikipedia.org/wiki/Spectral_clustering>`_

SparkQA · 2018-04-29T18:01:44Z

Test build #89970 has finished for PR 21119 at commit ae9f953.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-05-10T16:45:49Z

I think we messed up with the original PIC API. Could you please check out my comment here https://issues.apache.org/jira/browse/SPARK-15784 ? If others agree, I'll revert the Scala API and we can work on adding a modified version.

huaxingao · 2018-05-10T17:41:12Z

@jkbradley
Thanks for letting me know. I will change the python API accordingly after the new scala version is in.

mengxr · 2018-06-05T04:34:10Z

@huaxingao We updated the Scala/Java API in #21493. Could you update this PR for the Python API? It should be similar to the PrefixSpan Python API (90ae98d), which is neither a transformer nor an estimator. Let me know if you don't have time. @WeichenXu123 could update the Python API as well.

huaxingao · 2018-06-06T16:34:43Z

@mengxr @WeichenXu123 I will update this. Thanks.

mengxr · 2018-06-08T16:35:00Z

@huaxingao Any updates?

huaxingao · 2018-06-08T17:04:05Z

@mengxr Sorry for the delay. I will submit an update later today. Do you want me to close this PR and do a new one? or just update this PR?

WeichenXu123 · 2018-06-08T17:24:49Z

@huaxingao Create a new PR is better I think.

huaxingao · 2018-06-08T17:33:45Z

@mengxr @WeichenXu123 I will close this one and submit a new PR soon. Thanks!

[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

53d7763

huaxingao added 2 commits April 23, 2018 13:21

fix test problem

2d0e394

remove extra space

387d6ff

WeichenXu123 reviewed Apr 26, 2018

View reviewed changes

address comments

6d00f34

fix doc format problem

a6b1822

viirya reviewed Apr 28, 2018

View reviewed changes

address comments (2)

c25d3dc

viirya reviewed Apr 29, 2018

View reviewed changes

fix python doc error

ae9f953

huaxingao closed this Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC #21119

[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC #21119

huaxingao commented Apr 21, 2018

SparkQA commented Apr 21, 2018

SparkQA commented Apr 23, 2018

SparkQA commented Apr 23, 2018

huaxingao commented Apr 25, 2018

WeichenXu123 left a comment

WeichenXu123 Apr 26, 2018

huaxingao Apr 27, 2018

WeichenXu123 Apr 26, 2018

WeichenXu123 Apr 26, 2018

WeichenXu123 Apr 26, 2018

WeichenXu123 Apr 26, 2018

WeichenXu123 Apr 26, 2018

huaxingao Apr 27, 2018

SparkQA commented Apr 27, 2018

SparkQA commented Apr 28, 2018

viirya Apr 28, 2018

viirya Apr 28, 2018

viirya Apr 28, 2018

viirya Apr 28, 2018

viirya Apr 28, 2018

viirya Apr 28, 2018

SparkQA commented Apr 29, 2018

viirya Apr 29, 2018

SparkQA commented Apr 29, 2018

jkbradley commented May 10, 2018

huaxingao commented May 10, 2018

mengxr commented Jun 5, 2018

huaxingao commented Jun 6, 2018

mengxr commented Jun 8, 2018

huaxingao commented Jun 8, 2018

WeichenXu123 commented Jun 8, 2018

huaxingao commented Jun 8, 2018

		@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
		return self.getOrDefault(self.keepLastCheckpoint)


		class _PowerIterationClusteringParams(JavaParams, HasMaxIter, HasPredictionCol):

[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC #21119

[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC #21119

Conversation

huaxingao commented Apr 21, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 21, 2018

SparkQA commented Apr 23, 2018

SparkQA commented Apr 23, 2018

huaxingao commented Apr 25, 2018

WeichenXu123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 27, 2018

SparkQA commented Apr 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 29, 2018

Choose a reason for hiding this comment

SparkQA commented Apr 29, 2018

jkbradley commented May 10, 2018

huaxingao commented May 10, 2018

mengxr commented Jun 5, 2018

huaxingao commented Jun 6, 2018

mengxr commented Jun 8, 2018

huaxingao commented Jun 8, 2018

WeichenXu123 commented Jun 8, 2018

huaxingao commented Jun 8, 2018