(how to install Hadoop in pseudo-distributed mode)
There are three supported mode to install Hadoop:
- Local (Standalone) Mode
- Pseudo-Distributed Mode
- Fully-Distributed Mode
We choose to install the pseudo-distributed, because is really close to the real distributed mode.
Choose a version: http://hadoop.apache.org/releases.html#Download
We suggest to install any Linux distribution on Oracle VM VirtualBox to work virtually in a safe way.
The following steps are just few hints, is more a guideline than precise directives.
-
Install Java Development Kit.
-
Create a new environment variable called
JAVA_HOME
in yourPATH
variable. -
Install ssh and create new keys, then enable passwordless connection with
localhost
, so:
- ssh-keygen (don't setup passphrases)
- ssh-copy-id -i .ssh/id_rsa.pub localhost
-
Download a stable version of hadoop (better if
0.X
or1.X
). -
Unzip in your file system (create a link. Ex.:
ln -s hadoop-X.X.X hadoop
) -
Add in your
PATH
variableHADOOP_HOME
with the path of the link above. -
Uncomment and modify
export JAVA_HOME=...
inconf/hadoop-env.sh
. -
Create a folder for temporary files, for instance called
hadoop.tmp.dir ......./tmp A base for other temporary directories. Ex.: /opt/hadoop/tmp with subfolders dfs/name/ fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.tmp
. Add these properties toconf/core-site.xml
with the right path of thetmp
folder (this folder has to contain this other two folders:dfs/name/
). -
Add this property to
dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.conf/hdfs-site.xml
-
Add this property to
mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.conf/mapred-site.xml
-
Format hdfs with this command
hadoop namenode -format
-
Now you can start hdfs and mapreduce (
start-dfs.sh
andstart-mapred.sh
)
- A useful online guide: Running Hadoop on ubuntu linux single node cluster
- How to set up and configure a single-node Hadoop installation: Apache Hadoop - Single Node Setup
- Instructions primarily for the 0.2x series of Hadoop: Hadoop Wiki - Getting Started With Hadoop