Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC #21119

Closed
wants to merge 7 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

add spark.ml Python API for PIC

How was this patch tested?

add doctest

@SparkQA
Copy link

SparkQA commented Apr 21, 2018

Test build #89672 has finished for PR 21119 at commit 53d7763.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _PowerIterationClusteringParams(JavaParams, HasMaxIter, HasPredictionCol):
  • class PowerIterationClustering(JavaTransformer, _PowerIterationClusteringParams, JavaMLReadable,

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89735 has finished for PR 21119 at commit 2d0e394.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89737 has finished for PR 21119 at commit 387d6ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

@jkbradley Could you please review when you have time? Thank you very much in advance!

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I made a pass.

@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
return self.getOrDefault(self.keepLastCheckpoint)


class _PowerIterationClusteringParams(JavaParams, HasMaxIter, HasPredictionCol):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly add params into class PowerIterationClustering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 Thanks for your review. The params can be either inside class PowerIterationClustering or separate. I will move them back inside class PowerIterationClustering, to be consistent with the params in the other classes in clustering.

class PowerIterationClustering(JavaTransformer, _PowerIterationClusteringParams, JavaMLReadable,
JavaMLWritable):
"""
Model produced by [[PowerIterationClustering]].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc is wrong. Copy doc from scala side.

idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"):
"""
setParams(self, predictionCol="prediction", k=2, maxIter=20, initMode="random",\
idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove : at the end.

idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"):
"""
__init__(self, predictionCol="prediction", k=2, maxIter=20, initMode="random",\
idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove : at the end.

@since("2.4.0")
def getK(self):
"""
Gets the value of `k`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use:
:py:attr:k
and update everywhere else.

... for j in range (i):
... neighbor.append((long)(j))
... weight.append(sim(points[i], points[j]))
... similarities.append([(long)(i), neighbor, weight])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doctest code looks like too long, maybe more proper to put it in examples.
Could you replace the data generation code here by using a simple hardcoded dataset ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 I will move this to tests, and add a simple example in the doctest.

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89943 has finished for PR 21119 at commit 6d00f34.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PowerIterationClustering(HasMaxIter, HasPredictionCol, JavaTransformer, JavaParams,

@SparkQA
Copy link

SparkQA commented Apr 28, 2018

Test build #89946 has finished for PR 21119 at commit a6b1822.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -97,13 +97,15 @@ private[clustering] trait PowerIterationClusteringParams extends Params with Has
def getNeighborsCol: String = $(neighborsCol)

/**
* Param for the name of the input column for neighbors in the adjacency list representation.
* Param for the name of the input column for non-negative weights (similarities) of edges
* between the vertex in `idCol` and each neighbor in `neighborsCol`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@since("2.4.0")
def setNeighborsCol(self, value):
"""
Sets the value of :py:attr:`neighborsCol.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the left back quote.

"""
Gets the value of :py:attr:`similaritiesCol`.
"""
return self.getOrDefault(self.binary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.binary -> self.similaritiesCol?

PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each input row
includes:

- :py:class:`idCol`: vertex ID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:py:attr:`idCol` ? And also the below :py:class:`neighborsCol` , etc...

- Input validation: This validates that similarities are non-negative but does NOT validate
that the input matrix is symmetric.

@see <a href=http://en.wikipedia.org/wiki/Spectral_clustering>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .. seealso::?

:py:class:`predictionCol` containing the cluster assignment in :py:class:`[0,k)` for
each row (vertex).

Notes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .. note::?

@SparkQA
Copy link

SparkQA commented Apr 29, 2018

Test build #89965 has finished for PR 21119 at commit c25d3dc.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

that the input matrix is symmetric.

.. seealso:: <a href=http://en.wikipedia.org/wiki/Spectral_clustering>
Spectral clustering (Wikipedia)</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check other places using seealso:

.. seealso:: `Spectral clustering \
<http://en.wikipedia.org/wiki/Spectral_clustering>`_

@SparkQA
Copy link

SparkQA commented Apr 29, 2018

Test build #89970 has finished for PR 21119 at commit ae9f953.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I think we messed up with the original PIC API. Could you please check out my comment here https://issues.apache.org/jira/browse/SPARK-15784 ? If others agree, I'll revert the Scala API and we can work on adding a modified version.

@huaxingao
Copy link
Contributor Author

@jkbradley
Thanks for letting me know. I will change the python API accordingly after the new scala version is in.

@mengxr
Copy link
Contributor

mengxr commented Jun 5, 2018

@huaxingao We updated the Scala/Java API in #21493. Could you update this PR for the Python API? It should be similar to the PrefixSpan Python API (90ae98d), which is neither a transformer nor an estimator. Let me know if you don't have time. @WeichenXu123 could update the Python API as well.

@huaxingao
Copy link
Contributor Author

@mengxr @WeichenXu123 I will update this. Thanks.

@mengxr
Copy link
Contributor

mengxr commented Jun 8, 2018

@huaxingao Any updates?

@huaxingao
Copy link
Contributor Author

@mengxr Sorry for the delay. I will submit an update later today. Do you want me to close this PR and do a new one? or just update this PR?

@WeichenXu123
Copy link
Contributor

@huaxingao Create a new PR is better I think.

@huaxingao
Copy link
Contributor Author

@mengxr @WeichenXu123 I will close this one and submit a new PR soon. Thanks!

@huaxingao huaxingao closed this Jun 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants