[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml #21090

jkbradley · 2018-04-17T20:20:05Z

What changes were proposed in this pull request?

This PR adds PowerIterationClustering as a Transformer to spark.ml. In the transform method, it calls spark.mllib's PowerIterationClustering.run() method and transforms the return value assignments (the Kmeans output of the pseudo-eigenvector) as a DataFrame (id: LongType, cluster: IntegerType).

This PR is copied and modified from #15770 The primary author is @wangmiao1981

How was this patch tested?

This PR has 2 types of tests:

Copies of tests from spark.mllib's PIC tests
New tests specific to the spark.ml APIs

jkbradley · 2018-04-17T20:23:03Z

To review this PR: This was copied from #15770 with the following changes:

Addressed comments in original PR (See my review comments there)
Added Param validators for required input columns
Renamed “weights” column to “similarities”
Made algorithm take more types of inputs: Long/Int and Double/Float
Removed test("set parameters") since setters are already tested in the read/write test.

If you saw the previous PR, you should be able to review this one based on the last 3 commits, viewable in this diff: jkbradley/spark@5cb8ed6...wangmiao1981-pic

jkbradley · 2018-04-17T20:23:49Z

@wangmiao1981 and @WeichenXu123 would you mind taking a look? Thanks!

SparkQA · 2018-04-17T20:24:46Z

Test build #89472 has finished for PR 21090 at commit d215748.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2018-04-17T21:46:00Z

mllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala

+      model.transform(typedData)
+    }
+    intercept[IllegalArgumentException] {
+


remove blank line or add blank line after line 139 for consistence?

wangmiao1981 · 2018-04-17T21:49:54Z

Take a quick look. Despite of the style failure and a minor format issue, LGTM.

wangmiao1981 · 2018-04-17T21:56:53Z

mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala

+          nbrs.asInstanceOf[Seq[Long]].zip(sims.asInstanceOf[Seq[Double]]).map {
+            case (nbr, similarity) => (id, nbr, similarity)
+          }
+      }


Add Instrumentation? Or I can add it in a separate PR.

Actually, we don't have any precedent for using Instrumentation in Models or Transformers, only Estimators. I'll hold off on this for now.

SparkQA · 2018-04-19T02:12:46Z

Test build #89536 has finished for PR 21090 at commit 375e150.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-04-19T16:37:38Z

Thanks for reviewing this and for the LGTM @wangmiao1981 ! I'll merge with master now, with you as the primary author.

huaxingao · 2018-04-20T02:22:54Z

mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala

+  val similaritiesCol = new Param[String](this, "similaritiesCol",
+    "Name of the input column for neighbors in the adjacency list representation.",
+    (value: String) => value.nonEmpty)
+


Seems similaritiesCol is exactly the same as neighborsCol. Is this right?

No, it's meant to be an adjacency list representation of the graph: neighborsCol has the set of neighbor vertex IDs, and similaritiesCol has the corresponding set of edge weights.

wangmiao1981 and others added 29 commits April 16, 2018 13:55

add pic framework (model, class etc)

e4492a6

change a comment

7086249

add missing functions fit predict load save etc.

b73d8a7

add unit test flie

022fe52

add test cases part 1

552cf54

add unit test part 2: test fit, parameters etc.

0b4954d

fix a type issue

f22b01e

add more unit tests

305b194

delete unused import and add comments

4b32cbf

change version to 2.1.0

f6eda88

change PIC as a Transformer

45c4b1c

add LabelCol

e8d7ed3

change col implementation

e4e1e05

address some of the comments

8384422

add additional test with dataset having more data

d6a199c

change input data format

b0c3aff

resolve warnings

091225d

add neighbor and weight cols

8bb9956

address review comments 1

8ba82e8

fix style

468a947

remove unused comments

ec10f24

add Since

5710cfc

fix missing >

88654b3

fix doc

804adc6

address review comments

4a6dd79

fix unit test

5cb8ed6

cleanups to docs

6abf602

typo

d927087

final updates for PIC PR

d215748

wangmiao1981 reviewed Apr 17, 2018

View reviewed changes

fixed scala style

375e150

jkbradley changed the title ~~[SPARK-15784][ML] Add Power Iteration Clustering to spark.ml~~ [SPARK-24026][ML] Add Power Iteration Clustering to spark.ml Apr 19, 2018

asfgit closed this in a471880 Apr 19, 2018

huaxingao reviewed Apr 20, 2018

View reviewed changes

jkbradley deleted the wangmiao1981-pic branch May 17, 2018 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml #21090

[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml #21090

jkbradley commented Apr 17, 2018

jkbradley commented Apr 17, 2018

jkbradley commented Apr 17, 2018

SparkQA commented Apr 17, 2018

wangmiao1981 Apr 17, 2018

wangmiao1981 commented Apr 17, 2018

wangmiao1981 Apr 17, 2018

jkbradley Apr 19, 2018 •

edited

Loading

SparkQA commented Apr 19, 2018

jkbradley commented Apr 19, 2018

huaxingao Apr 20, 2018

jkbradley May 17, 2018

[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml #21090

[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml #21090

Conversation

jkbradley commented Apr 17, 2018

What changes were proposed in this pull request?

How was this patch tested?

jkbradley commented Apr 17, 2018

jkbradley commented Apr 17, 2018

SparkQA commented Apr 17, 2018

wangmiao1981 Apr 17, 2018

Choose a reason for hiding this comment

wangmiao1981 commented Apr 17, 2018

wangmiao1981 Apr 17, 2018

Choose a reason for hiding this comment

jkbradley Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Apr 19, 2018

jkbradley commented Apr 19, 2018

huaxingao Apr 20, 2018

Choose a reason for hiding this comment

jkbradley May 17, 2018

Choose a reason for hiding this comment

jkbradley Apr 19, 2018 •

edited

Loading