Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank #14998

Closed
wants to merge 21 commits into from

Conversation

moustaki
Copy link

@moustaki moustaki commented Sep 7, 2016

(Updated version of PR-9457, rebased on latest Spark master, and using mllib-local).

This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.

I ran a few benchmarks on the full DBpedia graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.

image

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65050 has finished for PR 14998 at commit 1ec345f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65060 has finished for PR 14998 at commit 2d00fc0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65061 has finished for PR 14998 at commit 46381c8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.graphx._
import breeze.linalg.{Vector => BV}

import org.apache.spark.ml.linalg.{Vector, Vectors}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move org.apache.spark.ml.linalg.{Vector, Vectors} down, and move org.apache.spark.graphx._ up. We have an automated system detecting if the import is alphabetical, and that is failing the build.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65064 has finished for PR 14998 at commit 4b2d564.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Sep 8, 2016

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65076 has finished for PR 14998 at commit 4b2d564.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

rankGraph.edges.foreachPartition(x => {}) // also materializes rankGraph.vertices
prevRankGraph.vertices.unpersist(false)
prevRankGraph.edges.unpersist(false)

Copy link
Member

@dbtsai dbtsai Sep 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add sourcesInitMapBC.destory(false) here, otherwise, the explicit broadcast variable will not be deleted.

@dbtsai
Copy link
Member

dbtsai commented Sep 8, 2016

Some minor issues, and LGTM. Thanks.

@SparkQA
Copy link

SparkQA commented Sep 10, 2016

Test build #65176 has finished for PR 14998 at commit 40f5780.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Sep 10, 2016

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Sep 10, 2016

Test build #65180 has finished for PR 14998 at commit adc5fc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2016

Test build #65183 has finished for PR 14998 at commit adc5fc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Sep 10, 2016

Merged into master. Great work! Thanks.

@asfgit asfgit closed this in 1fec3ce Sep 10, 2016
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
(Updated version of [PR-9457](apache#9457), rebased on latest Spark master, and using mllib-local).

This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.

I ran a few benchmarks on the full [DBpedia](http://dbpedia.org/) graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.

![image](https://cloud.githubusercontent.com/assets/2491/10927702/dd82e4fa-8256-11e5-89a8-4799b407f502.png)

Author: Yves Raimond <[email protected]>

Closes apache#14998 from moustaki/parallel-ppr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants