[SPARK-24213][ML]Power Iteration Clustering in SparkML throws exception, when the ID in IntType #21270

shahidki31 · 2018-05-08T20:34:32Z

While running the following code, PIC throws exception.

val data = spark.createDataFrame(Seq(
      (0, Array(1), Array(0.9)),
      (1, Array(2), Array(0.9)),
      (2, Array(3), Array(0.9)),
      (3, Array(4), Array(0.1)),
      (4, Array(5), Array(0.9))
    )).toDF("id", "neighbors", "similarities")

val result = new PowerIterationClustering()
      .setK(2)
      .setMaxIter(10)
      .setInitMode("random")
      .transform(data)
      .select("id", "prediction")

Result
org.apache.spark.sql.AnalysisException: cannot resolve 'prediction`' given input columns: [id, neighbors, similarities];;
'Project [id#215, 'prediction]
+- AnalysisBarrier
+- Project [id#215, neighbors#216, similarities#217]
+- Join Inner, (id#215 = id#234)
:- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS similarities#217]
: +- LocalRelation [_1#209, _2#210, _3#211]
+- Project [cast(id#230L as int) AS id#234]
+- LogicalRDD [id#230L, prediction#231], false

at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)

`

What changes were proposed in this pull request?

PIC needs to return only "id" and "predictions". Currently it returns the entire data, including neighborhood array and similarity array.
Join operation to the existing dataset will skip the cluster labels of ID, which are not there in the ID column but there in the neighborhood ID column. So, instead of joining, we can directly return the "id-prediction" dataFrame, so that it will not skip any nodes. (This is the behavior of Spark MLLib)

How was this patch tested?

Added a UT

AmplabJenkins · 2018-05-08T20:37:47Z

Can one of the admins verify this patch?

jkbradley · 2018-05-08T22:04:42Z

Thanks for the patch! I just commented on https://issues.apache.org/jira/browse/SPARK-24213 though and would like to replace this with #21274
Could you please close this issue and help with reviewing the other PR? Thanks!

shahidki31 · 2018-05-09T02:04:18Z

Thank you @jkbradley. Actually one more issue is there. Currently we are skipping some of the nodes which are not there in the ID column, but there in the neighboring column. Spark MLLib is diplaying cluster indices corresponding to all the nodes.

So, Is it necessary for the join operation?Shall I open a new PR, adressing the issue? Kindly reply

WeichenXu123 · 2018-05-09T08:54:21Z

@shahidki31 Seemingly what you said above is anothor issue ? You can create another jira for that. :)

shahidki31 · 2018-05-09T09:48:46Z

@WeichenXu123 Thanks for the comment. I have created another Jira and I have raised a PR for that. That PR will fix this issue as well. Can you please review the PR?

Jira : https://issues.apache.org/jira/browse/SPARK-24217
PR: #21277

shahidki31 added 2 commits May 9, 2018 00:53

Example code for Power Iteration Clustering

f7bb93a

Example code for Power Iteration Clustering

ff9e079

shahidki31 changed the title ~~Power Iteration Clustering in SparkML throws exception, when the ID in IntType~~ [SPARK-24213][ML]Power Iteration Clustering in SparkML throws exception, when the ID in IntType May 8, 2018

shahidki31 closed this May 10, 2018

shahidki31 deleted the sparkSim branch June 8, 2018 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24213][ML]Power Iteration Clustering in SparkML throws exception, when the ID in IntType #21270

[SPARK-24213][ML]Power Iteration Clustering in SparkML throws exception, when the ID in IntType #21270

shahidki31 commented May 8, 2018 •

edited

Loading

AmplabJenkins commented May 8, 2018

jkbradley commented May 8, 2018

shahidki31 commented May 9, 2018 •

edited

Loading

WeichenXu123 commented May 9, 2018

shahidki31 commented May 9, 2018

[SPARK-24213][ML]Power Iteration Clustering in SparkML throws exception, when the ID in IntType #21270

[SPARK-24213][ML]Power Iteration Clustering in SparkML throws exception, when the ID in IntType #21270

Conversation

shahidki31 commented May 8, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented May 8, 2018

jkbradley commented May 8, 2018

shahidki31 commented May 9, 2018 • edited Loading

WeichenXu123 commented May 9, 2018

shahidki31 commented May 9, 2018

shahidki31 commented May 8, 2018 •

edited

Loading

shahidki31 commented May 9, 2018 •

edited

Loading