-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank #14998
Conversation
Test build #65050 has finished for PR 14998 at commit
|
Test build #65060 has finished for PR 14998 at commit
|
Test build #65061 has finished for PR 14998 at commit
|
import org.apache.spark.graphx._ | ||
import breeze.linalg.{Vector => BV} | ||
|
||
import org.apache.spark.ml.linalg.{Vector, Vectors} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move org.apache.spark.ml.linalg.{Vector, Vectors}
down, and move org.apache.spark.graphx._
up. We have an automated system detecting if the import is alphabetical, and that is failing the build.
Test build #65064 has finished for PR 14998 at commit
|
Jenkins, retest this please |
Test build #65076 has finished for PR 14998 at commit
|
rankGraph.edges.foreachPartition(x => {}) // also materializes rankGraph.vertices | ||
prevRankGraph.vertices.unpersist(false) | ||
prevRankGraph.edges.unpersist(false) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add sourcesInitMapBC.destory(false)
here, otherwise, the explicit broadcast variable will not be deleted.
Some minor issues, and LGTM. Thanks. |
Test build #65176 has finished for PR 14998 at commit
|
Jenkins, retest this please |
Test build #65180 has finished for PR 14998 at commit
|
Test build #65183 has finished for PR 14998 at commit
|
Merged into master. Great work! Thanks. |
(Updated version of [PR-9457](apache#9457), rebased on latest Spark master, and using mllib-local). This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel. I ran a few benchmarks on the full [DBpedia](http://dbpedia.org/) graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater. ![image](https://cloud.githubusercontent.com/assets/2491/10927702/dd82e4fa-8256-11e5-89a8-4799b407f502.png) Author: Yves Raimond <[email protected]> Closes apache#14998 from moustaki/parallel-ppr.
(Updated version of PR-9457, rebased on latest Spark master, and using mllib-local).
This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.
I ran a few benchmarks on the full DBpedia graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.