-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank #9457
Conversation
Jenkins, test this please |
Test build #44990 has finished for PR 9457 at commit
|
Jenkins, test this please |
Jenkins, add to whitelist |
Test build #45033 has finished for PR 9457 at commit
|
<dependency> | ||
<groupId>org.apache.commons</groupId> | ||
<artifactId>commons-math3</artifactId> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since mllib depneds on graphx, please remove the breeze dependencies in mllib.
Is return type I'm working on refactoring out the basic type in mllib into separate package, but this will not happen soon. (will likely happen in 1.7). |
Thanks for all the comments @dbtsai. I'll go through your various comments soon. About your last one, what would you recommend for the time being? |
@moustaki For the sparse vector issue, we can wait for the change in spark 1.7 since this PR will not be in 1.6. Thanks. |
@dbtsai Just went through all your comments. Thanks a lot for the feedback! |
Test build #46574 has finished for PR 9457 at commit
|
@moustaki This PR looks good to me. I'll start to work on refactoring the mllib vector out after 1.6 release, so graphx can use those data structure. Then we can change BSV to mllib sparse vector implementation. Thanks. |
@dbtsai Sounds good - do you want to hold off merging in the meantime? |
@moustaki yes, we need to hold off now until 1.7 window. |
Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. |
Sounds good. @dbtsai Let me know when your mllib vector refactor is out and I'll re-open this PR. |
Hello @moustaki , The work of having standalone vectors and matrices had been done in SPARK-13944, and what you need to do is adding the dependency of Thanks. |
Thanks @dbtsai! |
(Updated version of [PR-9457](#9457), rebased on latest Spark master, and using mllib-local). This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel. I ran a few benchmarks on the full [DBpedia](http://dbpedia.org/) graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater. ![image](https://cloud.githubusercontent.com/assets/2491/10927702/dd82e4fa-8256-11e5-89a8-4799b407f502.png) Author: Yves Raimond <[email protected]> Closes #14998 from moustaki/parallel-ppr.
(Updated version of [PR-9457](apache#9457), rebased on latest Spark master, and using mllib-local). This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel. I ran a few benchmarks on the full [DBpedia](http://dbpedia.org/) graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater. ![image](https://cloud.githubusercontent.com/assets/2491/10927702/dd82e4fa-8256-11e5-89a8-4799b407f502.png) Author: Yves Raimond <[email protected]> Closes apache#14998 from moustaki/parallel-ppr.
@moustaki Hey, I am currently working on a project that requires to perform PersonalizedPageRanks in parallel. Has this been merged, cause I can't find official documentation on this. |
@makalaaneesh Yes, it has been superseeded by another PR #9457 and merged. It's available in 2.1. |
This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.
I ran a few benchmarks on the full DBpedia graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.