Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml #21090

Closed
wants to merge 30 commits into from

Conversation

jkbradley
Copy link
Member

What changes were proposed in this pull request?

This PR adds PowerIterationClustering as a Transformer to spark.ml. In the transform method, it calls spark.mllib's PowerIterationClustering.run() method and transforms the return value assignments (the Kmeans output of the pseudo-eigenvector) as a DataFrame (id: LongType, cluster: IntegerType).

This PR is copied and modified from #15770 The primary author is @wangmiao1981

How was this patch tested?

This PR has 2 types of tests:

  • Copies of tests from spark.mllib's PIC tests
  • New tests specific to the spark.ml APIs

@jkbradley
Copy link
Member Author

To review this PR: This was copied from #15770 with the following changes:

  • Addressed comments in original PR (See my review comments there)
  • Added Param validators for required input columns
  • Renamed “weights” column to “similarities”
  • Made algorithm take more types of inputs: Long/Int and Double/Float
  • Removed test("set parameters") since setters are already tested in the read/write test.

If you saw the previous PR, you should be able to review this one based on the last 3 commits, viewable in this diff: jkbradley/spark@5cb8ed6...wangmiao1981-pic

@jkbradley
Copy link
Member Author

@wangmiao1981 and @WeichenXu123 would you mind taking a look? Thanks!

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89472 has finished for PR 21090 at commit d215748.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

model.transform(typedData)
}
intercept[IllegalArgumentException] {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line or add blank line after line 139 for consistence?

@wangmiao1981
Copy link
Contributor

Take a quick look. Despite of the style failure and a minor format issue, LGTM.

nbrs.asInstanceOf[Seq[Long]].zip(sims.asInstanceOf[Seq[Double]]).map {
case (nbr, similarity) => (id, nbr, similarity)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Instrumentation? Or I can add it in a separate PR.

Copy link
Member Author

@jkbradley jkbradley Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we don't have any precedent for using Instrumentation in Models or Transformers, only Estimators. I'll hold off on this for now.

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89536 has finished for PR 21090 at commit 375e150.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Thanks for reviewing this and for the LGTM @wangmiao1981 ! I'll merge with master now, with you as the primary author.

@jkbradley jkbradley changed the title [SPARK-15784][ML] Add Power Iteration Clustering to spark.ml [SPARK-24026][ML] Add Power Iteration Clustering to spark.ml Apr 19, 2018
@asfgit asfgit closed this in a471880 Apr 19, 2018
val similaritiesCol = new Param[String](this, "similaritiesCol",
"Name of the input column for neighbors in the adjacency list representation.",
(value: String) => value.nonEmpty)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems similaritiesCol is exactly the same as neighborsCol. Is this right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's meant to be an adjacency list representation of the graph: neighborsCol has the set of neighbor vertex IDs, and similaritiesCol has the corresponding set of edge weights.

@jkbradley jkbradley deleted the wangmiao1981-pic branch May 17, 2018 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants