-
Notifications
You must be signed in to change notification settings - Fork 106
Setting up a mirroring cluster
The Github API limit (currently, 5000 reqs/sec) makes it impossible to retrieve all data linked from events on a single node. For that, GHTorrent was designed to work on multiple phases in a distributed fashion. Depending on the data you want to collect, a cluster setup may be necessary.
A full GHTorrent cluster consists of the following types of nodes:
- Event retrieval nodes: Nodes that query the public Github event API for new events. More than one instances are required to both ensure that no events are lost due to spikes in event generation and that machine or network malfunctions the event collection machine do not affect the service.
- Linked data retrieval nodes: Retrieval of data linked by events is where the Github API is imposing the most significant restrictions.
- MongoDB shards: A MongoDB installation can be sharded (have the data spread on multiple nodes) on a per collection basis. Sharing MongoDB helps with both distributing the storage requirements and faster querying. Sharding is transparent to the application and therefore no modification is required to GHTorrent to work with a shared MongoDB. See more on MongoDB sharding here
- RabbitMQ active-active mirrors: RabbitMQ can work in cluster mode for high availability. GHTorrent will include basic support for this by reconnecting to a node from a list of active nodes.
- MySQL nodes: MySQL can be configured with master/slave replication. As failover in such clusters is automatic, GHTorrent does not include any special support, at the cost of single transaction lost (it will be retried as the transaction failure will trigger a requeuing of the message that initiated it).
To setup a GHTorrent cluster, at least 2 nodes are required. With one node, chances are that events might be lost during periods of extreme Github load. The more the available nodes, the more data can be collected and the more resilient to external errors or mirroring script bugs the cluster will be. The following sections assume a Debian-based operating system.
The data retrieval program (ght-data-retrieval
) is not too demanding
on processing resources. If you have multiple IP addresses at your
disposal, you can easily run it on a multihomed host, which will also
make administration easier. See instructions at the end of
this guide.
The versions of RabbitMQ and MongoDB included in Debian stable are ancient. Fresh versions need to be installed from the repositories provided by VMWare and 10gen, respectively. The MySQL version on Debian stable is adequate for the purposes of GHTorrent.
We describe briefly how each system can be setup. For more advanced scenarios (clustering etc) please refer to each software manual.
- Add MongoDB repository and install MongoDB v > 2.0
sudo echo "deb http://downloads-distro.mongodb.org/repo/debian-sysvinit dist 10gen" >/etc/apt/sources.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
sudo apt-get install mongodb-10gen
- Create a MongoDB
ghtorrent
account
$ mongo admin
> db.addUser('ghtorrent', 'ghtorrent')
> use github
> db.addUser('ghtorrent', 'ghtorrent')
- Add RabbitMQ repository and install RabbitMQ v > 2.8
sudo echo "deb http://www.rabbitmq.com/debian/ testing main" > /etc/apt/sources.list
sudo wget http://www.rabbitmq.com/rabbitmq-signing-key-public.asc; apt-key add rabbitmq-signing-key-public.asc
apt-get install -t=testing rabbitmq-server
- Configure a RabbitMQ
ghtorrent
user
$ rabbitmqctl add_user ghtorrent ghtorrent
$ rabbitmqctl set_permissions -p / ghtorrent ".*" ".*" ".*"
# The following will enable the RabbitMQ web admin for the github user
# Not necessary to have, but good to debug and diagnose problems
$ rabbitmq-plugins enable rabbitmq_management
$ rabbitmqctl set_user_tags ghtorrent administrator
- Create a MySQL
ghtorrent
account with localhost access
mysql -u root -p
mysql> create user 'ghtorrent'@'localhost' identified by 'ghtorrent';
mysql> create database ghtorrent;
mysql> GRANT ALL PRIVILEGES ON ghtorrent.* to ghtorrent@'localhost';
mysql> flush privileges;
The following steps will create a new GHTorrent cluster node. To ensure
that the mirroring operations will continue even if there are bugs in
the code, we use the supervise
program (from
D.J. Bernstein daemontools
package) which monitors processes and
restarts them in case of errors.
- Install the necessary dependencies
apt-get update
apt-get -y install ruby rubygems git daemontools screen sudo
- Add a user for the mirroring operations.
adduser ghtorrent
- Install
ghtorrent
sudo gem install ghtorrent
- Select whether the node will be a data retrieval or event mirror node and setup process supervision:
mkdir ghtorrent
ln -s /var/lib/gems/1.8/bin/ght-mirror-events ghtorrent/run
# or
ln -s /var/lib/gems/1.8/bin/ght-data-retrieval ghtorrent/run
-
Copy the contents of the config.yaml.tmpl file to a new file (e.g.
config.yaml
). Configure the connection details for MongoDB and RabbitMQ. You can pass the configuration file as input to either script with the-c
option. -
If MongoDB and/or RabbitMQ hosts run in a separate network from the host you are currently configuring (or even on the same network, since MongoDB does not provide for SSL encrypted sockets), you can use
ssh
port forwarding to setup a secure channel between the local machine and the MongoDB/RabbitMQ host(s). To do so, create agithub
user on the machine where MongoDB/RabbitMQ runs and do the following:
ssh -fN -L 5672:rabbithost:5672 ghtorrent@rabbithost ls
ssh -fN -L 27017:mongohost:27017 ghtorrent@mongohost ls
The created ports will not be deleted after the controlling terminal exits.
- Start mirroring
supervise ghtorrent &
Since version 0.5, scripts included in the GHTorrent distribution include
support for setting the IP address through which they bind their data
retrieval process to, throught the -a
option.
To run GHTorrent on a host with multiple IPs, you first need to configure multiple virtual interfaces on the same physical interface. This step is operating system specific and we therefore delegate this task to the reader. On Debian, the instructions are on the Debian administration wiki.
You also need to configure external programs in the same way as described above. Then:
-
Create a directory that will contain the run directories
-
Create a top level
config.yaml
file -
Create a file named
ips.txt
listing the configured IP addresses, one per line. Be careful to list the addresses themselves, not the host names. -
Set the config file and top level directory paths in the following script and run it. What the script does for each IP address is to create a directory in the top-level directory, create a small script required for process supervision and then starting the mirroring process.
#!/usr/bin/env bash
config=/home/gousiosg/config.yaml
mirror=/home/gousiosg/github-mirror
mkdir -p $mirror
cat ips.txt|
while read IP; do
mkdir -p $mirror/$IP
cd $mirror/$IP
echo '#!/bin/bash' > mirror.sh
echo "ght-data-retrieval -c $config -a ADDR" |sed -e "s/ADDR/$IP/" >> mirror.sh
chmod +x mirror.sh
ln -s mirror.sh run
cd ..
supervise $IP >> $IP/log.txt 2>&1 &
cd ..
done
- To stop all processes do
killall supervise
killall ruby
To stop individual processes, use standard Unix process management. Killing a process without killing its supervisor process effectively restarts a process. GHTorrent is (hopefully) designed to tolerate process failures.