-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml #21090
Conversation
To review this PR: This was copied from #15770 with the following changes:
If you saw the previous PR, you should be able to review this one based on the last 3 commits, viewable in this diff: jkbradley/spark@5cb8ed6...wangmiao1981-pic |
@wangmiao1981 and @WeichenXu123 would you mind taking a look? Thanks! |
Test build #89472 has finished for PR 21090 at commit
|
model.transform(typedData) | ||
} | ||
intercept[IllegalArgumentException] { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove blank line or add blank line after line 139 for consistence?
Take a quick look. Despite of the style failure and a minor format issue, LGTM. |
nbrs.asInstanceOf[Seq[Long]].zip(sims.asInstanceOf[Seq[Double]]).map { | ||
case (nbr, similarity) => (id, nbr, similarity) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Instrumentation
? Or I can add it in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we don't have any precedent for using Instrumentation in Models or Transformers, only Estimators. I'll hold off on this for now.
Test build #89536 has finished for PR 21090 at commit
|
Thanks for reviewing this and for the LGTM @wangmiao1981 ! I'll merge with master now, with you as the primary author. |
val similaritiesCol = new Param[String](this, "similaritiesCol", | ||
"Name of the input column for neighbors in the adjacency list representation.", | ||
(value: String) => value.nonEmpty) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems similaritiesCol is exactly the same as neighborsCol. Is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's meant to be an adjacency list representation of the graph: neighborsCol has the set of neighbor vertex IDs, and similaritiesCol has the corresponding set of edge weights.
What changes were proposed in this pull request?
This PR adds PowerIterationClustering as a Transformer to spark.ml. In the transform method, it calls spark.mllib's PowerIterationClustering.run() method and transforms the return value assignments (the Kmeans output of the pseudo-eigenvector) as a DataFrame (id: LongType, cluster: IntegerType).
This PR is copied and modified from #15770 The primary author is @wangmiao1981
How was this patch tested?
This PR has 2 types of tests: