[SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank #14998

moustaki · 2016-09-07T17:30:09Z

(Updated version of PR-9457, rebased on latest Spark master, and using mllib-local).

This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.

I ran a few benchmarks on the full DBpedia graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.

…l-ppr

SparkQA · 2016-09-07T17:34:10Z

Test build #65050 has finished for PR 14998 at commit 1ec345f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-07T23:29:04Z

Test build #65060 has finished for PR 14998 at commit 2d00fc0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-07T23:44:52Z

Test build #65061 has finished for PR 14998 at commit 46381c8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-09-07T23:49:13Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

-import org.apache.spark.graphx._
+import breeze.linalg.{Vector => BV}
+
+import org.apache.spark.ml.linalg.{Vector, Vectors}


move org.apache.spark.ml.linalg.{Vector, Vectors} down, and move org.apache.spark.graphx._ up. We have an automated system detecting if the import is alphabetical, and that is failing the build.

SparkQA · 2016-09-08T01:47:22Z

Test build #65064 has finished for PR 14998 at commit 4b2d564.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-09-08T04:07:29Z

Jenkins, retest this please

SparkQA · 2016-09-08T06:18:44Z

Test build #65076 has finished for PR 14998 at commit 4b2d564.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-09-08T21:10:50Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

+      rankGraph.edges.foreachPartition(x => {}) // also materializes rankGraph.vertices
+      prevRankGraph.vertices.unpersist(false)
+      prevRankGraph.edges.unpersist(false)
+


add sourcesInitMapBC.destory(false) here, otherwise, the explicit broadcast variable will not be deleted.

dbtsai · 2016-09-08T21:20:45Z

Some minor issues, and LGTM. Thanks.

SparkQA · 2016-09-10T00:54:10Z

Test build #65176 has finished for PR 14998 at commit 40f5780.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-09-10T00:56:32Z

Jenkins, retest this please

SparkQA · 2016-09-10T02:36:33Z

Test build #65180 has finished for PR 14998 at commit adc5fc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-10T03:07:05Z

Test build #65183 has finished for PR 14998 at commit adc5fc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-09-10T07:16:55Z

Merged into master. Great work! Thanks.

(Updated version of [PR-9457](apache#9457), rebased on latest Spark master, and using mllib-local). This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel. I ran a few benchmarks on the full [DBpedia](http://dbpedia.org/) graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater. ![image](https://cloud.githubusercontent.com/assets/2491/10927702/dd82e4fa-8256-11e5-89a8-4799b407f502.png) Author: Yves Raimond <[email protected]> Closes apache#14998 from moustaki/parallel-ppr.

Yves Raimond added 15 commits November 3, 2015 17:30

Parallel personalized pagerank implementation

f41975e

Scala style tweaks

3605e40

Removing breeze dependency from mllib (available through graphx)

8b34e5c

Renaming SparseVector to BSV

508ba45

Removing extra space, extra line

09d31c8

Code-style changes

8506353

Parallel personalized pagerank implementation

2d1dee7

Scala style tweaks

202acb2

Renaming SparseVector to BSV

69db385

Removing extra space, extra line

a42d272

Code-style changes

53ab670

Moving to mllib-local

31e2e98

Merge branch 'parallel-ppr' of github.com:moustaki/spark into paralle…

13fbf55

…l-ppr

Cleaning up pom dependencies

c7ca220

Removing unused import

1ec345f

Import style

2d00fc0

More import refactor

46381c8

dbtsai reviewed Sep 7, 2016
View reviewed changes

Alphabetical ordering of imports

4b2d564

dbtsai reviewed Sep 8, 2016
View reviewed changes

Destroying broadcast map

40f5780

Yves Raimond added 2 commits September 9, 2016 16:11

Minor styling

7dc2c23

Undoing destroy

adc5fc3

asfgit closed this in 1fec3ce Sep 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank #14998

[SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank #14998

moustaki commented Sep 7, 2016

SparkQA commented Sep 7, 2016

SparkQA commented Sep 7, 2016

SparkQA commented Sep 7, 2016

dbtsai Sep 7, 2016

SparkQA commented Sep 8, 2016

dbtsai commented Sep 8, 2016

SparkQA commented Sep 8, 2016

dbtsai Sep 8, 2016 •

edited

Loading

dbtsai commented Sep 8, 2016

SparkQA commented Sep 10, 2016

dbtsai commented Sep 10, 2016

SparkQA commented Sep 10, 2016

SparkQA commented Sep 10, 2016

dbtsai commented Sep 10, 2016

[SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank #14998

[SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank #14998

Conversation

moustaki commented Sep 7, 2016

SparkQA commented Sep 7, 2016

SparkQA commented Sep 7, 2016

SparkQA commented Sep 7, 2016

dbtsai Sep 7, 2016

Choose a reason for hiding this comment

SparkQA commented Sep 8, 2016

dbtsai commented Sep 8, 2016

SparkQA commented Sep 8, 2016

dbtsai Sep 8, 2016 • edited Loading

Choose a reason for hiding this comment

dbtsai commented Sep 8, 2016

SparkQA commented Sep 10, 2016

dbtsai commented Sep 10, 2016

SparkQA commented Sep 10, 2016

SparkQA commented Sep 10, 2016

dbtsai commented Sep 10, 2016

dbtsai Sep 8, 2016 •

edited

Loading