Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank #9457

Closed
wants to merge 6 commits into from

Conversation

moustaki
Copy link

@moustaki moustaki commented Nov 4, 2015

This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.

I ran a few benchmarks on the full DBpedia graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.

image

@dbtsai
Copy link
Member

dbtsai commented Nov 4, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #44990 has finished for PR 9457 at commit f41975e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@moustaki
Copy link
Author

moustaki commented Nov 4, 2015

Jenkins, test this please

@dbtsai
Copy link
Member

dbtsai commented Nov 4, 2015

Jenkins, add to whitelist

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #45033 has finished for PR 9457 at commit 3605e40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-math3</artifactId>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since mllib depneds on graphx, please remove the breeze dependencies in mllib.

@dbtsai
Copy link
Member

dbtsai commented Nov 4, 2015

Is return type Graph[SparseVector[Double], Double] required in this algorithm? Typically, we don't return breeze type in the public api. However, mllib's sparse vector is in mllib package, so there is no way to have graphx depending on mllib which will cause circular dependencies.

I'm working on refactoring out the basic type in mllib into separate package, but this will not happen soon. (will likely happen in 1.7).

@moustaki
Copy link
Author

moustaki commented Nov 7, 2015

Thanks for all the comments @dbtsai. I'll go through your various comments soon. About your last one, what would you recommend for the time being?

@dbtsai
Copy link
Member

dbtsai commented Nov 8, 2015

@moustaki For the sparse vector issue, we can wait for the change in spark 1.7 since this PR will not be in 1.6. Thanks.

@moustaki
Copy link
Author

@dbtsai Just went through all your comments. Thanks a lot for the feedback!

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46574 has finished for PR 9457 at commit 8506353.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Nov 24, 2015

@moustaki This PR looks good to me. I'll start to work on refactoring the mllib vector out after 1.6 release, so graphx can use those data structure. Then we can change BSV to mllib sparse vector implementation. Thanks.

@moustaki
Copy link
Author

@dbtsai Sounds good - do you want to hold off merging in the meantime?

@dbtsai
Copy link
Member

dbtsai commented Nov 24, 2015

@moustaki yes, we need to hold off now until 1.7 window.

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
@moustaki
Copy link
Author

Sounds good. @dbtsai Let me know when your mllib vector refactor is out and I'll re-open this PR.

@dbtsai
Copy link
Member

dbtsai commented Jul 4, 2016

Hello @moustaki ,

The work of having standalone vectors and matrices had been done in SPARK-13944, and what you need to do is adding the dependency of mllib-local into graphx with the following example https://github.com/apache/spark/pull/12298/files to use those types in graphx.

Thanks.

@moustaki
Copy link
Author

Thanks @dbtsai!

asfgit pushed a commit that referenced this pull request Sep 10, 2016
(Updated version of [PR-9457](#9457), rebased on latest Spark master, and using mllib-local).

This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.

I ran a few benchmarks on the full [DBpedia](http://dbpedia.org/) graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.

![image](https://cloud.githubusercontent.com/assets/2491/10927702/dd82e4fa-8256-11e5-89a8-4799b407f502.png)

Author: Yves Raimond <[email protected]>

Closes #14998 from moustaki/parallel-ppr.
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
(Updated version of [PR-9457](apache#9457), rebased on latest Spark master, and using mllib-local).

This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.

I ran a few benchmarks on the full [DBpedia](http://dbpedia.org/) graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.

![image](https://cloud.githubusercontent.com/assets/2491/10927702/dd82e4fa-8256-11e5-89a8-4799b407f502.png)

Author: Yves Raimond <[email protected]>

Closes apache#14998 from moustaki/parallel-ppr.
@makalaaneesh
Copy link

@moustaki Hey, I am currently working on a project that requires to perform PersonalizedPageRanks in parallel. Has this been merged, cause I can't find official documentation on this.

@moustaki
Copy link
Author

moustaki commented Mar 3, 2017

@makalaaneesh Yes, it has been superseeded by another PR #9457 and merged. It's available in 2.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants