Skip to content

jakemannix/giraph

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Giraph : Large-scale graph processing on Hadoop

Web and online social graphs have been rapidly growing in size and
scale during the past decade.  In 2008, Google estimated that the
number of web pages reached over a trillion.  Online social networking
and email sites, including Yahoo!, Google, Microsoft, Facebook,
LinkedIn, and Twitter, have hundreds of millions of users and are
expected to grow much more in the future.  Processing these graphs
plays a big role in relevant and personalized information for users,
such as results from a search engine or news in an online social
networking site.

Graph processing platforms to run large-scale algorithms (such as page
rank, shared connections, personalization-based popularity, etc.) have
become quite popular.  Some recent examples include Pregel and HaLoop.
For general-purpose big data computation, the map-reduce computing
model has been well adopted and the most deployed map-reduce
infrastructure is Apache Hadoop.  We have implemented a
graph-processing framework that is launched as a typical Hadoop job to
leverage existing Hadoop infrastructure, such as Amazon’s EC2.  Giraph
builds upon the graph-oriented nature of Pregel but additionally adds
fault-tolerance to the coordinator process with the use of ZooKeeper
as its centralized coordination service.

Giraph follows the bulk-synchronous parallel model relative to graphs
where vertices can send messages to other vertices during a given
superstep.  Checkpoints are initiated by the Giraph infrastructure at
user-defined intervals and are used for automatic application restarts
when any worker in the application fails.  Any worker in the
application can act as the application coordinator and one will
automatically take over if the current application coordinator fails.

-------------------------------

Hadoop versions for use with Giraph:

Secure Hadoop versions:
- Apache Hadoop 0.20.203, 0.20.204, other secure versions may work as well
-- Other versions reported working include:
---  Cloudera CDH3u0, CDH3u1

Unsecure Hadoop versions:
- Apache Hadoop 0.20.1, 0.20.2, 0.20.3

Facebook Hadoop release (https://github.com/facebook/hadoop-20-warehouse):
- GitHub master

While we provide support for the unsecure and Facebook versions of Hadoop
with the maven profiles 'hadoop_non_secure' and 'hadoop_facebook',
respectively, we have been primarily focusing on secure Hadoop releases
at this time.

-------------------------------

Building and testing:

You will need the following:
- Java 1.6
- Maven 3 or higher. Giraph uses the munge plugin (http://sonatype.github.com/munge-maven-plugin/),
  which requires Maven 3, to support multiple versions of Hadoop. Also, the
  web site plugin requires Maven 3.

Use the maven commands with secure Hadoop to:
- compile (i.e mvn compile)
- package (i.e. mvn package)
- test (i.e. mvn test)
-- For testing, one can submit the test to a running Hadoop instance
   (i.e. mvn test -Dprop.mapred.job.tracker=localhost:50300)

For the non-secure versions of Hadoop, run the maven commands with the
additional argument '-Dhadoop=non_secure' to enable the maven profile
'hadoop_non_secure'.  An example compilation command is
'mvn -Dhadoop=non_secure compile'.

For the Facebook Hadoop release, run the maven commands with the
additional arguments '-Dhadoop=facebook' to enable the maven profile
'hadoop_facebook' as well as a location for the hadoop core jar file.  An
example compilation command is 'mvn -Dhadoop=facebook
-Dhadoop.jar.path=/tmp/hadoop-0.20.1-core.jar compile'.


Notes: 
Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of
counters one can use, which is set to 120 by default. This limit restricts the
number of iterations/supersteps possible in Giraph. This limit can be increased
by setting a parameter "mapreduce.job.counters.limit" in job tracker's config
file mapred-site.xml.

About

Mirror of Apache Giraph

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%