Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24213][ML]Power Iteration Clustering in SparkML throws exception, when the ID in IntType #21270

Closed
wants to merge 2 commits into from

Conversation

shahidki31
Copy link
Contributor

@shahidki31 shahidki31 commented May 8, 2018

While running the following code, PIC throws exception.

val data = spark.createDataFrame(Seq(
      (0, Array(1), Array(0.9)),
      (1, Array(2), Array(0.9)),
      (2, Array(3), Array(0.9)),
      (3, Array(4), Array(0.1)),
      (4, Array(5), Array(0.9))
    )).toDF("id", "neighbors", "similarities")

val result = new PowerIterationClustering()
      .setK(2)
      .setMaxIter(10)
      .setInitMode("random")
      .transform(data)
      .select("id", "prediction")

Result
org.apache.spark.sql.AnalysisException: cannot resolve 'prediction`' given input columns: [id, neighbors, similarities];;
'Project [id#215, 'prediction]
+- AnalysisBarrier
+- Project [id#215, neighbors#216, similarities#217]
+- Join Inner, (id#215 = id#234)
:- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS similarities#217]
: +- LocalRelation [_1#209, _2#210, _3#211]
+- Project [cast(id#230L as int) AS id#234]
+- LogicalRDD [id#230L, prediction#231], false

at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)

`

What changes were proposed in this pull request?

  1. PIC needs to return only "id" and "predictions". Currently it returns the entire data, including neighborhood array and similarity array.

  2. Join operation to the existing dataset will skip the cluster labels of ID, which are not there in the ID column but there in the neighborhood ID column. So, instead of joining, we can directly return the "id-prediction" dataFrame, so that it will not skip any nodes. (This is the behavior of Spark MLLib)

How was this patch tested?

Added a UT

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@shahidki31 shahidki31 changed the title Power Iteration Clustering in SparkML throws exception, when the ID in IntType [SPARK-24213][ML]Power Iteration Clustering in SparkML throws exception, when the ID in IntType May 8, 2018
@jkbradley
Copy link
Member

Thanks for the patch! I just commented on https://issues.apache.org/jira/browse/SPARK-24213 though and would like to replace this with #21274
Could you please close this issue and help with reviewing the other PR? Thanks!

@shahidki31
Copy link
Contributor Author

shahidki31 commented May 9, 2018

Thank you @jkbradley. Actually one more issue is there. Currently we are skipping some of the nodes which are not there in the ID column, but there in the neighboring column. Spark MLLib is diplaying cluster indices corresponding to all the nodes.

So, Is it necessary for the join operation?Shall I open a new PR, adressing the issue? Kindly reply

@WeichenXu123
Copy link
Contributor

@shahidki31 Seemingly what you said above is anothor issue ? You can create another jira for that. :)

@shahidki31
Copy link
Contributor Author

@WeichenXu123 Thanks for the comment. I have created another Jira and I have raised a PR for that. That PR will fix this issue as well. Can you please review the PR?

Jira : https://issues.apache.org/jira/browse/SPARK-24217
PR: #21277

@shahidki31 shahidki31 closed this May 10, 2018
@shahidki31 shahidki31 deleted the sparkSim branch June 8, 2018 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants