[SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some vertices #21277

shahidki31 · 2018-05-09T03:07:17Z

What changes were proposed in this pull request?

Currently PIC in ML displays cluster indices of nodes in the ID column. If some of the nodes there in the neighboring column, but not there in the ID column, It will not display the cluster indices corresponding to that node.

As per the definition of PIC clustering, given in the code,

PIC takes an affinity matrix between items (or vertices) as input. An affinity matrix
is a symmetric matrix whose entries are non-negative similarities between items.
PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each input row includes:

idCol: vertex ID
neighborsCol: neighbors of vertex in idCol
similaritiesCol: non-negative weights (similarities) of edges between the vertex
in idCol and each neighbor in neighborsCol

"PIC returns a cluster assignment for each input vertex." It appends a new column predictionCol
containing the cluster assignment in [0,k) for each row (vertex).

We should display prediction and id corresponding to all the nodes. So, instead of the join operation to the existing dataFrame, we can directly return the prediction-id dataFrame corresponding to all the nodes.
For the same input in spark.ml and spark.mllib, spark.mllib giving cluster id for all the vertices.

How was this patch tested?

UT

Please review http://spark.apache.org/contributing.html before opening a pull request.

shahidki31 · 2018-05-10T20:12:47Z

Based on the comments in the JIRA, ( https://issues.apache.org/jira/browse/SPARK-24217), I am closing the PR.

AmplabJenkins · 2018-05-10T21:32:46Z

Can one of the admins verify this patch?

shahidki31 · 2018-06-08T18:56:39Z

Closing the PR due to the discussions in the JIRA, https://issues.apache.org/jira/browse/SPARK-15784 and the PR #21493

shahidki31 added 2 commits May 9, 2018 08:18

Example code for Power Iteration Clustering

c5eadb6

Example code for Power Iteration Clustering

213ca9f

shahidki31 changed the title ~~[ML]Power Iteration Clustering is not displaying cluster indices corresponding to some nodes.~~ [SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some nodes. May 9, 2018

shahidki31 mentioned this pull request May 9, 2018

[SPARK-24213][ML]Power Iteration Clustering in SparkML throws exception, when the ID in IntType #21270

Closed

shahidki31 changed the title ~~[SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some nodes.~~ [SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some vertices May 9, 2018

shahidki31 closed this May 10, 2018

shahidki31 reopened this May 10, 2018

shahidki31 closed this Jun 8, 2018

shahidki31 deleted the missingNode branch June 8, 2018 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some vertices #21277

[SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some vertices #21277

shahidki31 commented May 9, 2018 •

edited

Loading

shahidki31 commented May 10, 2018 •

edited

Loading

AmplabJenkins commented May 10, 2018

shahidki31 commented Jun 8, 2018

[SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some vertices #21277

[SPARK-24217][ML]Power Iteration Clustering is not displaying cluster indices corresponding to some vertices #21277

Conversation

shahidki31 commented May 9, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

shahidki31 commented May 10, 2018 • edited Loading

AmplabJenkins commented May 10, 2018

shahidki31 commented Jun 8, 2018

shahidki31 commented May 9, 2018 •

edited

Loading

shahidki31 commented May 10, 2018 •

edited

Loading