-
Notifications
You must be signed in to change notification settings - Fork 277
Installing RHadoop on RHEL
For a high-level description of each package, refer to the table here.
Official Supported platforms.
- Red Hat® Enterprise Linux® 6.5 (64-bit)
Supported Hadoop Clusters.
- Cloudera CDH 5
- Hortonworks HDP 2.1
The following table specifies where each package should be installed in your Hadoop cluster.
Package | Where to Install |
---|---|
plyrmr |
On every node in the cluster |
ravro |
Only on the node that runs the R client |
rhbase |
Only on the node that runs the R client |
rhdfs |
Only on the node that runs the R client |
rmr2 |
On every node in the cluster |
The RHadoop packages can be installed either manually or via a shell script. Both methods are described in this section. However, the commands listed in the shell script are to be used for guidance only, and should be adapted to standards of your IT department.
The following instructions are for installing and configuring for rmr2
.
On every node in the cluster, do the following:
-
Download the R packages dependencies for
rmr2
. Check the values for theDepends:
andImports:
lines in the package DESCRIPTION file for the most up-to-date list of dependencies. The suggested quickcheck is needed only for testing and a link to it can be found on its repo. -
Install
rmr2
and its dependent R packages. -
Update the environment variables needed by
rmr2
. The values for the environments will depend upon your Hadoop distribution.Important! These environment variables only need to be set on the nodes that are invoking the
rmr2
MapReduce jobs, such as anEdge
node. If you don’t know which nodes will be used, then set these variables on each node. Also, it is recommended to add these environment variables to the file/etc/profile
so that they will be available to all users.-
HADOOP_CMD
: The complete path to the “hadoop” executable. For example:export HADOOP_CMD=/usr/bin/hadoop
-
HADOOP_STREAMING
: The complete path to the Hadoop Streaming jar file. For example:export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
-
The following instructions are for installing and configuring for plyrmr
.
On every node in the cluster, do the following:
-
Download the dependent R packages for
plyrmr
. Check the values for theDepends:
andImports:
lines in the package DESCRIPTION file for the most up-to-date list of dependencies. -
Install
plyrmr
and its dependent R packages. -
Update the environment variables needed by
plyrmr
. The values for the environments will depend upon your Hadoop distribution.Important! These environment variables only need to be set on the nodes that are invoking the
rmr2
MapReduce jobs, such as anEdge
node. If you don’t know which nodes will be used, then set these variables on each node. Also, it is recommended to add these environment variables to the file/etc/profile
so that they will be available to all users.-
HADOOP_CMD
: The complete path to the “hadoop” executable. For example:export HADOOP_CMD=/usr/bin/hadoop
-
HADOOP_STREAMING
: The complete path to the Hadoop Streaming jar file. For example:export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
-
The following instructions are for installing and configuring for rhdfs
.
On the node that will run the R client, do the following:
-
Download and install the rJava R package.
Important! If the installation of rJava fails, you may need to configure R to run properly with Java. First, check to be sure you have the Java JDK installed, and the environment variable
JAVA_HOME
is pointing to the Java JDK. To configure R to run with Java, type the command:R CMD javareconf
, and then try installing rJava again. -
Update the environment variable needed by
rhdfs
. The value for the environment variable will depend upon your hadoop distribution.
HADOOP_CMD
: The complete path to the “hadoop” executable. For example:
export HADOOP_CMD=/usr/bin/hadoop
**Important!** This environment variable only needs to be set on the nodes that are using the `rhdfs` package, such as an `Edge` node. Also, it is recommended to add this environment variable to the file `/etc/profile` so that it will be available to all users.
-
Install
rhdfs
only on the node that will run the R client.
The following instructions are for installing and configuring for rhdfs
.
On the node that will run the R client, do the following:
-
Build and install Apache Thrift. We recommend that you install on the node containing the HBase Master. See http://thrift.apache.org/ for more details on building and installing Thrift.
-
Install the dependencies for Thrift. At the prompt, type:
yum -y install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel openssl-devel
Important! If installing as NON-ROOT, then you will need a system administrator to help install these dependencies.
-
Unpack the Thrift archive. At the prompt, type:
tar -xzf thrift-0.8.0.tar.gz
- Change directory to the versioned
thrift
directory. At the prompt, type
cd thrift-0.8.0
- Build the Thrift library. We only need the C++ interface of Thrift, so we build without ruby or python. At the prompt, type the following two commands:
./configure --without-ruby --without-python
make
- Install the Thrift library. At the prompt, type:
make install
**Important!** If installing as NON-ROOT, then this command will most likely require root privileges, and will have to be executed by your system administrator.
- Create a symbolic link to the Thrift library so it can be loaded by the
rhbase
package. Example of symbolic link:
ln -s /usr/local/lib/libthrift-0.8.0.so /usr/lib64
**Important!** If installing as NON-ROOT, then you may need a system administrator to execute this command for you.
- Setup the
PKG_CONFIG_PATH
environment variable. At the prompt, type:
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig
-
Install
rhbase
only on the node that will run the R client.
The following instructions are for installing and configuring for ravro
.
On the node that will run the R client, do the following:
-
Download the dependent R packages for
ravro
. Check the values for theDepends:
andImports:
lines in the package DESCRIPTION file for the most up-to-date list of dependencies. -
Install
ravro
and its dependent R packages only on the node that will run the R client.
There are two sets of tests you should do to verify that your configuration is working.
##First Tests
The first set of tests will check that the installed packages can be loaded and initialized.
- Invoke R. At the prompt, type:
R
- Load and initialize the
rmr2
package, and execute some simple commands. At the R prompt, type the following commands:
Note: The “>” symbol in the following code is the ‘R’ prompt and should not be typed.
> library(rmr2)
> from.dfs(to.dfs(1:100))
> from.dfs(mapreduce(to.dfs(1:100)))
If any errors occur:
1. Verify that Revolution R Open / Microsoft R Open is installed on each node in the cluster.
1. Check that `rmr2`, and its dependent packages are installed on each node in the cluster.
1. Make sure that a link to Rscript executable is in the PATH on each node in the Hadoop cluster.
1. The user that invoked ‘R’ has read and write permissions to HDFS.
1. Verify that the `HADOOP_CMD` environment variable is set, exported and its value is the complete path of the “hadoop” executable.
1. Verify that the `HADOOP_STREAMING` environment variable is set, exported and its value is the complete path to the Hadoop Streaming jar file.
1. If you encounter errors like the following (see below), check the `stderr` log file for the job, and resolve any errors reported. The easiest way to find the log files is to use the tracking URL (i.e. `http://<my_ip_address>:50030/jobdetails.jsp?jobid=job_201208162037_0011`)
```
12/08/24 21:21:16 INFO streaming.StreamJob: Running job: job_201208162037_0011
12/08/24 21:21:16 INFO streaming.StreamJob: To kill this job, run:
12/08/24 21:21:16 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=<my_ip_address>:8021 -kill job_201208162037_0011
12/08/24 21:21:16 INFO streaming.StreamJob: Tracking URL: http://<my_ip_address>:50030/jobdetails.jsp?jobid=job_201208162037_0011
12/08/24 21:21:17 INFO streaming.StreamJob: map 0% reduce 0%
12/08/24 21:21:23 INFO streaming.StreamJob: map 50% reduce 0%
12/08/24 21:21:31 INFO streaming.StreamJob: map 50% reduce 17%
12/08/24 21:21:45 INFO streaming.StreamJob: map 100% reduce 100%
12/08/24 21:21:45 INFO streaming.StreamJob: To kill this job, run:
12/08/24 21:21:45 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=<my_ip_address>:8021 -kill job_201208162037_0011
12/08/24 21:21:45 INFO streaming.StreamJob: Tracking URL: http://<my_ip_address>:50030/jobdetails.jsp?jobid=job_201208162037_0011
12/08/24 21:21:45 ERROR streaming.StreamJob: Job not successful. Error: NA
12/08/24 21:21:45 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if (is.list(input)) { :
hadoop streaming failed with error code 1
```
- Load and initialize the
rhdfs
package. At the R prompt, type the following commands:
Note: The “>” symbol in the following code is the ‘R’ prompt and should not be typed.
> library(rhdfs)
> hdfs.init()
> hdfs.ls("/")
If any errors occur:
-
Verify that the
rJava
package is installed, configured and loaded. -
Verify that the
HADOOP_CMD
is set and its value is set to the complete path of the “hadoop” executable, and exported. -
Load and initialize the
rhbase
package.
Note: The “>” symbol in the following code is the ‘R’ prompt and should not be typed.
> library(rhbase)
> hb.init()
> hb.list.tables()
If any errors occur:
-
Verify that the Thrift Server is running (refer to your Hadoop documentation for more details).
-
Verify that the default port for the Thrift Server is
9090
. Be sure there is not a port conflict with other running processes. -
Check to be sure you are not running the Thrift Server in
hsha
ornonblocking
mode. If necessary, use thethreadpool
command line parameter to start the server (i.e./usr/bin/hbase thrift –threadpool start
).
##Second Tests
The second set of tests will verify that your configuration is working properly using the standard R mechanism for checking packages.
Important! Be aware that running the tests for the rmr2
package may take a significant time (hours) to complete. If you run the tests for rmr2
, then you will need the quickcheck R package on every node in the cluster as well.
-
Go to the directory where the R package source (
rmr2
,rhdfs
,rhbase
) exist. -
Check each package. An example of the commands for each package:
R CMD check rmr2_3.2.0.tar.gz
R CMD check rhdfs_1.0.8.tar.gz
R CMD check rhbase_1.2.1.tar.gz
If any errors occur, refer to the error verification information described above under First Tests.
Note: Errors referring to missing package pdflatex
can be ignored, such as:
Error in texi2dvi("Rd2.tex", pdf = (out_ext == "pdf"), quiet = FALSE, : pdflatex is not available Error in running tools::texi2dvi