-
Notifications
You must be signed in to change notification settings - Fork 13
Configuration
The LDBC data generator uses Apache Hadoop 3.2.1.
To install Hadoop, untar hadoop-3.2.1.tar.gz
to your home folder ~
(we will use the home for this example, but you can choose the folder that best fits your needs):
cd ~
wget http://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz
This will create a directory named hadoop-3.2.1
in your home folder. Hadoop can be configured to run in three different modes: Standalone, Pseudo-Distributed and Distributed modes. By default, Hadoop is configured to run in Standalone mode, which can only have, at most, one reducer per job. This works well for generating small data sets in a local environment. To configure and start Hadoop in Pseudo-Distributed and Distributed modes, please visit the Single Node Cluster and the Cluster Setup pages, respectively.
-
Fine-tune the logger. We found that setting log levels in
mapred-conf.xml
does not yield the expected result but there are two ways that work.- Use an environment variable:
# reduce clutter in the Hadoop output export HADOOP_LOGLEVEL=WARN
- Overwrite the
$HADOOP_HOME/etc/hadoop/log4j.properties
file with thesrc/main/resources/log4j.properties
file.
-
Set the number of threads. For example, to get 8 threads, run:
echo "ldbc.snb.datagen.generator.numThreads:8" >> params.ini
The main configuration is through the file params.ini
in the ldbc_snb_datagen
directory.
You can set multiple options as listed in Advanced Configuration.
We provide a run.sh
script to ease the execution of Hadoop. The following variables are used to configure the script:
-
HADOOP_HOME
: points to where Hadoop was installed. Following our example, this folder is~/hadoop-3.2.1
. -
LDBC_SNB_DATAGEN_HOME
: points to the LDBC data generator folder. PARAM_GENERATION
[deprecated]: whether the parameters for SNB queries are generated. You should only use it with standard scaleFactor (e.g., SF 1). Always disablePARAM_GENERATION
when using the data generator for non-standard input parameters (e.g., when you setnumYears
instead of usingscaleFactor
).
Example configuration (you might want to save these in the .bashrc
file):
export HADOOP_HOME=~/hadoop-3.2.1
# optional configurations
export LDBC_SNB_DATAGEN_HOME=`pwd` # set to the ldbc_snb_datagen repo's location
export HADOOP_CLIENT_OPTS="-Xmx2G" # increase for sizes above SF1
Finally, open ~/hadoop-3.2.1/etc/hadoop/hadoop-env.sh
and set JAVA_HOME
to point to your JDK folder.
To make sure the Hadoop job does not run out of memory, increase the heap size (-Xmx
) to a sufficient value (see the Troubleshooting page for details).
If you'd like to skip parameter generation, add the following line in the Datagen configuration (params.ini
):
ldbc.snb.datagen.parametergenerator.parameters:false
The LDBC data generator is configured by means of the params.ini
file, which is found at the LDBC data generator root folder. Set the parameters properly to meet your needs. There are two ways to configure the size of the desired data output: by setting the scale factor or by setting the number of persons, starting year and the number of years the data generated span. The params.ini
file contains the following options:
-
ldbc.snb.datagen.generator.numPersons
- default:
10000
- description: The number of persons to generate
- default:
-
ldbc.snb.datagen.generator.numYears
- default:
3
- description: The amount of years of activity
- default:
-
ldbc.snb.datagen.generator.startYear
- default:
2010
- description: The start year of simulation.
- default:
-
ldbc.snb.datagen.serializer.personSerializer
- description: The class to serialize the persons and knows relationships of the network
- other options:
-
ldbc.snb.datagen.serializer.invariantSerializer
- description: The class to serialize the persons and knows relationships of the network
- options:
-
ldbc.snb.datagen.serializer.personActivitySerializer
- description: The class to serialize the persons and knows relationships of the network
- options:
-
ldbc.snb.datagen.serializer.compressed
- default:
false
- description: Specifies to compress the output data in gzip.
- default:
-
ldbc.snb.datagen.serializer.outputDir
- default:
./
- description: Specifies the folder to output the data.
- default:
-
ldbc.snb.datagen.serializer.updateStreams
- default:
false
- description: Specifies to generate the update streams of the network. If set to false, then the update portion of the network is output as static
- default:
-
ldbc.snb.datagen.serializer.activity
- default:
true
- description: Specifies to generate a person's activity. Its value is
true
by default and for the SNB scale factors, andfalse
for the Graphalytics scale factors.
- default:
-
ldbc.snb.datagen.generator.numThreads
- default:
1
- description: Sets the number of threads to use. Only works for pseudo-distributed mode, see Section "Setup Hadoop".
- default:
Besides these parameters, Datagen supports predefined configurations (numPersons
, startYear
, numYears
, degreeDistribution
, etc.), named scale factors, which serve to generate data at different scales for specific benchmarks such as the LDBC SNB or Graphalytics. The semantics of scale factors depend on the benchmark they belong to. Currently, the following scale factors are defined:
snb.interactive.0.1
snb.interactive.0.3
snb.interactive.1
snb.interactive.3
snb.interactive.10
snb.interactive.30
snb.interactive.100
snb.interactive.300
snb.interactive.1000
graphalytics.1
graphalytics.3
graphalytics.10
graphalytics.30
graphalytics.100
graphalytics.300
graphalytics.1000
graphalytics.3000
graphalytics.10000
graphalytics.30000
These scale factors are set by means of option ldbc.snb.datagen.generator.scaleFactor
. Scale factors are loaded at the beginning of params.ini parsing. Comment with "#" other options affecting the amount of data generated not to conflict with them. If both the scale factor and the number of persons, start year or number of years are set, the latter will have a higher priority.
An example of a configuration file (by number of persons, start year and number of years):
ldbc.snb.datagen.generator.numPersons:100000
ldbc.snb.datagen.generator.numYears:3
ldbc.snb.datagen.generator.startYear:2010
ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonSerializer
ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVInvariantSerializer
ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonActivitySerializer
ldbc.snb.datagen.generator.numThreads:1
For the LDBC SNB Interactive and BI workloads, Datagen uses the same configuration parameters and classes (snb.interactive.*
for scale factors and ldbc.snb.datagen.serializer.snb.interactive.*
for serializers).