Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some vertices #21277

Closed
wants to merge 2 commits into from

Conversation

shahidki31
Copy link
Contributor

@shahidki31 shahidki31 commented May 9, 2018

What changes were proposed in this pull request?

  1. Currently PIC in ML displays cluster indices of nodes in the ID column. If some of the nodes there in the neighboring column, but not there in the ID column, It will not display the cluster indices corresponding to that node.

As per the definition of PIC clustering, given in the code,

PIC takes an affinity matrix between items (or vertices) as input. An affinity matrix
is a symmetric matrix whose entries are non-negative similarities between items.
PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each input row includes:

  • idCol: vertex ID
  • neighborsCol: neighbors of vertex in idCol
  • similaritiesCol: non-negative weights (similarities) of edges between the vertex
    in idCol and each neighbor in neighborsCol
  • "PIC returns a cluster assignment for each input vertex." It appends a new column predictionCol
    containing the cluster assignment in [0,k) for each row (vertex).
  1. We should display prediction and id corresponding to all the nodes. So, instead of the join operation to the existing dataFrame, we can directly return the prediction-id dataFrame corresponding to all the nodes.
    For the same input in spark.ml and spark.mllib, spark.mllib giving cluster id for all the vertices.

How was this patch tested?

UT

Please review http://spark.apache.org/contributing.html before opening a pull request.

@shahidki31 shahidki31 changed the title [ML]Power Iteration Clustering is not displaying cluster indices corresponding to some nodes. [SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some nodes. May 9, 2018
@shahidki31 shahidki31 changed the title [SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some nodes. [SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some vertices May 9, 2018
@shahidki31
Copy link
Contributor Author

shahidki31 commented May 10, 2018

Based on the comments in the JIRA, ( https://issues.apache.org/jira/browse/SPARK-24217), I am closing the PR.

@shahidki31 shahidki31 closed this May 10, 2018
@shahidki31 shahidki31 reopened this May 10, 2018
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@shahidki31
Copy link
Contributor Author

Closing the PR due to the discussions in the JIRA, https://issues.apache.org/jira/browse/SPARK-15784 and the PR #21493

@shahidki31 shahidki31 closed this Jun 8, 2018
@shahidki31 shahidki31 deleted the missingNode branch June 8, 2018 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants