-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC #21119
Conversation
Test build #89672 has finished for PR 21119 at commit
|
Test build #89735 has finished for PR 21119 at commit
|
Test build #89737 has finished for PR 21119 at commit
|
@jkbradley Could you please review when you have time? Thank you very much in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I made a pass.
python/pyspark/ml/clustering.py
Outdated
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self): | |||
return self.getOrDefault(self.keepLastCheckpoint) | |||
|
|||
|
|||
class _PowerIterationClusteringParams(JavaParams, HasMaxIter, HasPredictionCol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not directly add params into class PowerIterationClustering
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 Thanks for your review. The params can be either inside class PowerIterationClustering or separate. I will move them back inside class PowerIterationClustering, to be consistent with the params in the other classes in clustering.
python/pyspark/ml/clustering.py
Outdated
class PowerIterationClustering(JavaTransformer, _PowerIterationClusteringParams, JavaMLReadable, | ||
JavaMLWritable): | ||
""" | ||
Model produced by [[PowerIterationClustering]]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc is wrong. Copy doc from scala side.
python/pyspark/ml/clustering.py
Outdated
idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"): | ||
""" | ||
setParams(self, predictionCol="prediction", k=2, maxIter=20, initMode="random",\ | ||
idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove :
at the end.
python/pyspark/ml/clustering.py
Outdated
idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"): | ||
""" | ||
__init__(self, predictionCol="prediction", k=2, maxIter=20, initMode="random",\ | ||
idCol="id", neighborsCol="neighbors", similaritiesCol="similarities"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove :
at the end.
python/pyspark/ml/clustering.py
Outdated
@since("2.4.0") | ||
def getK(self): | ||
""" | ||
Gets the value of `k` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use:
:py:attr:k
and update everywhere else.
python/pyspark/ml/clustering.py
Outdated
... for j in range (i): | ||
... neighbor.append((long)(j)) | ||
... weight.append(sim(points[i], points[j])) | ||
... similarities.append([(long)(i), neighbor, weight]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doctest code looks like too long, maybe more proper to put it in examples.
Could you replace the data generation code here by using a simple hardcoded dataset ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 I will move this to tests, and add a simple example in the doctest.
Test build #89943 has finished for PR 21119 at commit
|
Test build #89946 has finished for PR 21119 at commit
|
@@ -97,13 +97,15 @@ private[clustering] trait PowerIterationClusteringParams extends Params with Has | |||
def getNeighborsCol: String = $(neighborsCol) | |||
|
|||
/** | |||
* Param for the name of the input column for neighbors in the adjacency list representation. | |||
* Param for the name of the input column for non-negative weights (similarities) of edges | |||
* between the vertex in `idCol` and each neighbor in `neighborsCol`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
python/pyspark/ml/clustering.py
Outdated
@since("2.4.0") | ||
def setNeighborsCol(self, value): | ||
""" | ||
Sets the value of :py:attr:`neighborsCol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing the left back quote.
python/pyspark/ml/clustering.py
Outdated
""" | ||
Gets the value of :py:attr:`similaritiesCol`. | ||
""" | ||
return self.getOrDefault(self.binary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.binary
-> self.similaritiesCol
?
python/pyspark/ml/clustering.py
Outdated
PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each input row | ||
includes: | ||
|
||
- :py:class:`idCol`: vertex ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:py:attr:`idCol`
? And also the below :py:class:`neighborsCol`
, etc...
python/pyspark/ml/clustering.py
Outdated
- Input validation: This validates that similarities are non-negative but does NOT validate | ||
that the input matrix is symmetric. | ||
|
||
@see <a href=http://en.wikipedia.org/wiki/Spectral_clustering> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use .. seealso::
?
python/pyspark/ml/clustering.py
Outdated
:py:class:`predictionCol` containing the cluster assignment in :py:class:`[0,k)` for | ||
each row (vertex). | ||
|
||
Notes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use .. note::
?
Test build #89965 has finished for PR 21119 at commit
|
python/pyspark/ml/clustering.py
Outdated
that the input matrix is symmetric. | ||
|
||
.. seealso:: <a href=http://en.wikipedia.org/wiki/Spectral_clustering> | ||
Spectral clustering (Wikipedia)</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check other places using seealso
:
.. seealso:: `Spectral clustering \
<http://en.wikipedia.org/wiki/Spectral_clustering>`_
Test build #89970 has finished for PR 21119 at commit
|
I think we messed up with the original PIC API. Could you please check out my comment here https://issues.apache.org/jira/browse/SPARK-15784 ? If others agree, I'll revert the Scala API and we can work on adding a modified version. |
@jkbradley |
@huaxingao We updated the Scala/Java API in #21493. Could you update this PR for the Python API? It should be similar to the PrefixSpan Python API (90ae98d), which is neither a transformer nor an estimator. Let me know if you don't have time. @WeichenXu123 could update the Python API as well. |
@mengxr @WeichenXu123 I will update this. Thanks. |
@huaxingao Any updates? |
@mengxr Sorry for the delay. I will submit an update later today. Do you want me to close this PR and do a new one? or just update this PR? |
@huaxingao Create a new PR is better I think. |
@mengxr @WeichenXu123 I will close this one and submit a new PR soon. Thanks! |
What changes were proposed in this pull request?
add spark.ml Python API for PIC
How was this patch tested?
add doctest