- Prerequisite
- Install Java
- Install
ssh
andpdsh
if needed
- Download a stable Hadoop release
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar -xvf hadoop-2.7.7.tar.gz
- Make changes to
.bashrc
export JAVA_HOME=/usr
export HADOOP_HOME=$HOME/hadoop-2.7.7
export HADOOP_CONF_DIR=$HOME/hadoop-2.7.7/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.7.7
export HADOOP_COMMON_HOME=$HOME/hadoop-2.7.7
export HADOOP_HDFS_HOME=$HOME/hadoop-2.7.7
export YARN_HOME=$HOME/hadoop-2.7.7
export PATH=$PATH:$HOME/hadoop-2.7.7/bin
export HADOOP_CLASSPATH=/usr/lib/jvm/java-openjdk/lib/tools.jar
-
Make the changes work by executing
source .bashrc
-
Try
java -version
andhadoop version
to check whether it works
- Copy the master node's ssh key (create one if there's none) to slave's authorized keys.
ssh-copy-id -i $HOME/.ssh/id_rsa.pub username@slave
-
Create master files in
etc/hadoop/masters
and add master's IP to it. -
Add slave IPs in
etc/hadoop/slaves
. -
Edit
core-site.xml
on both master and slave machines as follows:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master-ip:9000</value>
</property>
</configuration>
- Edit
hdfs-site.xml
on the master machine as follows:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/haoranq4/hadoop-2.7.7/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/haoranq4/hadoop-2.7.7/datanode</value>
</property>
</configuration>
- Edit
hdfs-site.xml
on slave machines as follows:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/haoranq4/hadoop-2.7.7/datanode</value>
</property>
</configuration>
- Copy
mapred-site
from the template in configuration folder and the editmapred-site.xml
on both master and slave machines as follows:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
- Edit
yarn-site.xml
on both master and slave machines as follows:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
- Format the namenode (only on the master machine).
hadoop namenode -format
- Start all daemons (only on the master machine).
[haoranq4@master hadoop-2.7.7]$ ./sbin/start-dfs.sh
[haoranq4@master hadoop-2.7.7]$ ./sbin/start-yarn.sh
- Check all the daemons running on both master and slave machines.
jps
On the master machine, you should see something like this:
20869 SecondaryNameNode
21206 NodeManager
20553 NameNode
22620 Jps
32093 Server
21069 ResourceManager
20703 DataNode
On the slave machine, you should see something like this:
1173 Jps
1500 Server
17134 DataNode
1054 NodeManager
[haoranq4@master hadoop-2.7.7]$ ./sbin/stop-dfs.sh
[haoranq4@master hadoop-2.7.7]$ ./sbin/stop-yarn.sh
hadoop fs -mkdir -p /test/input
hadoop fs -put test-files/input-folder /test/input
hadoop fs -ls /test/input
hadoop jar applications/wc-hadoop.jar wordcount /test/input /test/output
hadoop fs -ls /test/output
hadoop fs -get /test/output/part-r-00000 output-folder/output.txt
hadoop fs -mkdir -p /test/input
hadoop fs -put test-files/input-folder /test/input
hadoop fs -ls /test/input
hadoop jar applications/rwlg-hadoop.jar ReverseWebLink /test/input /test/output
hadoop fs -ls /test/output
hadoop fs -get /test/output/part-r-00000 output-folder/output.txt
bin/hadoop com.sun.tools.javac.Main ReverseWebLink.java -> ReverseWebLink*.class
jar cf rwlg.jar ReverseWebLink*.class -> rwlg.jar
Follow the above instructions to execute your .jar
applications on Hadoop!