Skip to content

Installing RHadoop on RHEL

j-martens edited this page Nov 24, 2015 · 1 revision

Introduction

For a high-level description of each package, refer to the table here.

System Requirements

Official Supported platforms.

  • Red Hat® Enterprise Linux® 6.5 (64-bit)

Supported Hadoop Clusters.

  • Cloudera CDH 5
  • Hortonworks HDP 2.1

Where to Install Each RHadoop Package

The following table specifies where each package should be installed in your Hadoop cluster.

Package Where to Install
plyrmr On every node in the cluster
ravro Only on the node that runs the R client
rhbase Only on the node that runs the R client
rhdfs Only on the node that runs the R client
rmr2 On every node in the cluster

Installing on Red Hat

The RHadoop packages can be installed either manually or via a shell script. Both methods are described in this section. However, the commands listed in the shell script are to be used for guidance only, and should be adapted to standards of your IT department.

Installing rmr2

The following instructions are for installing and configuring for rmr2.

On every node in the cluster, do the following:

  1. Download rmr2.

  2. Download the R packages dependencies for rmr2. Check the values for the Depends: and Imports: lines in the package DESCRIPTION file for the most up-to-date list of dependencies. The suggested quickcheck is needed only for testing and a link to it can be found on its repo.

  3. Install rmr2 and its dependent R packages.

  4. Update the environment variables needed by rmr2. The values for the environments will depend upon your Hadoop distribution.

    Important! These environment variables only need to be set on the nodes that are invoking the rmr2 MapReduce jobs, such as an Edge node. If you don’t know which nodes will be used, then set these variables on each node. Also, it is recommended to add these environment variables to the file /etc/profile so that they will be available to all users.

    • HADOOP_CMD: The complete path to the “hadoop” executable. For example:
      export HADOOP_CMD=/usr/bin/hadoop
      
    • HADOOP_STREAMING: The complete path to the Hadoop Streaming jar file. For example:
      export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
      

Installing plyrmr

The following instructions are for installing and configuring for plyrmr.

On every node in the cluster, do the following:

  1. Download plyrmr.

  2. Download the dependent R packages for plyrmr. Check the values for the Depends: and Imports: lines in the package DESCRIPTION file for the most up-to-date list of dependencies.

  3. Install plyrmr and its dependent R packages.

  4. Update the environment variables needed by plyrmr. The values for the environments will depend upon your Hadoop distribution.

    Important! These environment variables only need to be set on the nodes that are invoking the rmr2 MapReduce jobs, such as an Edge node. If you don’t know which nodes will be used, then set these variables on each node. Also, it is recommended to add these environment variables to the file /etc/profile so that they will be available to all users.

    • HADOOP_CMD: The complete path to the “hadoop” executable. For example:
      export HADOOP_CMD=/usr/bin/hadoop
      
    • HADOOP_STREAMING: The complete path to the Hadoop Streaming jar file. For example:
      export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
      

Installing rhdfs

The following instructions are for installing and configuring for rhdfs.

On the node that will run the R client, do the following:

  1. Download and install the rJava R package.

    Important! If the installation of rJava fails, you may need to configure R to run properly with Java. First, check to be sure you have the Java JDK installed, and the environment variable JAVA_HOME is pointing to the Java JDK. To configure R to run with Java, type the command: R CMD javareconf, and then try installing rJava again.

  2. Update the environment variable needed by rhdfs. The value for the environment variable will depend upon your hadoop distribution.

HADOOP_CMD: The complete path to the “hadoop” executable. For example:

export HADOOP_CMD=/usr/bin/hadoop
**Important!**  This environment variable only needs to be set on the nodes that are using the `rhdfs` package, such as an `Edge` node.  Also, it is recommended to add this environment variable to the file `/etc/profile` so that it will be available to all users.
  1. Download the rhdfs package.

  2. Install rhdfs only on the node that will run the R client.

Installing rhbase

The following instructions are for installing and configuring for rhdfs.

On the node that will run the R client, do the following:

  1. Build and install Apache Thrift. We recommend that you install on the node containing the HBase Master. See http://thrift.apache.org/ for more details on building and installing Thrift.

  2. Install the dependencies for Thrift. At the prompt, type:

    yum -y install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel openssl-devel
    

    Important! If installing as NON-ROOT, then you will need a system administrator to help install these dependencies.

  3. Download the Thrift archive

  4. Unpack the Thrift archive. At the prompt, type:

tar -xzf thrift-0.8.0.tar.gz
  1. Change directory to the versioned thrift directory. At the prompt, type
cd thrift-0.8.0
  1. Build the Thrift library. We only need the C++ interface of Thrift, so we build without ruby or python. At the prompt, type the following two commands:
./configure --without-ruby --without-python
make
  1. Install the Thrift library. At the prompt, type:
make install
**Important!** If installing as NON-ROOT, then this command will most likely require root privileges, and will have to be executed by your system administrator.
  1. Create a symbolic link to the Thrift library so it can be loaded by the rhbase package. Example of symbolic link:
ln -s /usr/local/lib/libthrift-0.8.0.so /usr/lib64
**Important!**  If installing as NON-ROOT, then you may need a system administrator to execute this command for you. 
  1. Setup the PKG_CONFIG_PATH environment variable. At the prompt, type:
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig
  1. Download the rhbase package.

  2. Install rhbase only on the node that will run the R client.

Installing ravro

The following instructions are for installing and configuring for ravro.

On the node that will run the R client, do the following:

  1. Download the ravro package

  2. Download the dependent R packages for ravro. Check the values for the Depends: and Imports: lines in the package DESCRIPTION file for the most up-to-date list of dependencies.

  3. Install ravro and its dependent R packages only on the node that will run the R client.

Testing Package Configurations

There are two sets of tests you should do to verify that your configuration is working.

##First Tests

The first set of tests will check that the installed packages can be loaded and initialized.

  1. Invoke R. At the prompt, type:
R
  1. Load and initialize the rmr2 package, and execute some simple commands. At the R prompt, type the following commands:
    Note: The “>” symbol in the following code is the ‘R’ prompt and should not be typed.
> library(rmr2)
> from.dfs(to.dfs(1:100))
> from.dfs(mapreduce(to.dfs(1:100)))

If any errors occur:

1. Verify that Revolution R Open / Microsoft R Open is installed on each node in the cluster.

1. Check that `rmr2`, and its dependent packages are installed on each node in the cluster.

1. Make sure that a link to Rscript executable is in the PATH on each node in the Hadoop cluster. 

1. The user that invoked ‘R’ has read and write permissions to HDFS.

1. Verify that the `HADOOP_CMD` environment variable is set, exported and its value is the complete path of the “hadoop” executable.

1. Verify that the `HADOOP_STREAMING` environment variable is set, exported and its value is the complete path to the Hadoop Streaming jar file.

1. If you encounter errors like the following (see below), check the `stderr` log file for the job, and resolve any errors reported.   The easiest way to find the log files is to use the tracking URL (i.e.  `http://<my_ip_address>:50030/jobdetails.jsp?jobid=job_201208162037_0011`)

  ```
  12/08/24 21:21:16 INFO streaming.StreamJob: Running job: job_201208162037_0011
  12/08/24 21:21:16 INFO streaming.StreamJob: To kill this job, run:
  12/08/24 21:21:16 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=<my_ip_address>:8021 -kill job_201208162037_0011
  12/08/24 21:21:16 INFO streaming.StreamJob: Tracking URL: http://<my_ip_address>:50030/jobdetails.jsp?jobid=job_201208162037_0011
  12/08/24 21:21:17 INFO streaming.StreamJob:  map 0%  reduce 0%
  12/08/24 21:21:23 INFO streaming.StreamJob:  map 50%  reduce 0%
  12/08/24 21:21:31 INFO streaming.StreamJob:  map 50%  reduce 17%
  12/08/24 21:21:45 INFO streaming.StreamJob:  map 100%  reduce 100%
  12/08/24 21:21:45 INFO streaming.StreamJob: To kill this job, run:
  12/08/24 21:21:45 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=<my_ip_address>:8021 -kill job_201208162037_0011
  12/08/24 21:21:45 INFO streaming.StreamJob: Tracking URL: http://<my_ip_address>:50030/jobdetails.jsp?jobid=job_201208162037_0011
  12/08/24 21:21:45 ERROR streaming.StreamJob: Job not successful. Error: NA
  12/08/24 21:21:45 INFO streaming.StreamJob: killJob...
  Streaming Command Failed!
  Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if (is.list(input)) { :
  hadoop streaming failed with error code 1
  ```
  1. Load and initialize the rhdfs package. At the R prompt, type the following commands:
    Note: The “>” symbol in the following code is the ‘R’ prompt and should not be typed.
> library(rhdfs)
> hdfs.init()
> hdfs.ls("/")

If any errors occur:

  1. Verify that the rJava package is installed, configured and loaded.

  2. Verify that the HADOOP_CMD is set and its value is set to the complete path of the “hadoop” executable, and exported.

  3. Load and initialize the rhbase package.
    Note: The “>” symbol in the following code is the ‘R’ prompt and should not be typed.

> library(rhbase)
> hb.init()
> hb.list.tables()

If any errors occur:

  1. Verify that the Thrift Server is running (refer to your Hadoop documentation for more details).

  2. Verify that the default port for the Thrift Server is 9090. Be sure there is not a port conflict with other running processes.

  3. Check to be sure you are not running the Thrift Server in hsha or nonblocking mode. If necessary, use the threadpool command line parameter to start the server (i.e. /usr/bin/hbase thrift –threadpool start).

##Second Tests

The second set of tests will verify that your configuration is working properly using the standard R mechanism for checking packages.

Important! Be aware that running the tests for the rmr2 package may take a significant time (hours) to complete. If you run the tests for rmr2, then you will need the quickcheck R package on every node in the cluster as well.

  1. Go to the directory where the R package source (rmr2, rhdfs, rhbase) exist.

  2. Check each package. An example of the commands for each package:

R CMD check rmr2_3.2.0.tar.gz
R CMD check rhdfs_1.0.8.tar.gz
R CMD check rhbase_1.2.1.tar.gz

If any errors occur, refer to the error verification information described above under First Tests.

Note: Errors referring to missing package pdflatex can be ignored, such as: Error in texi2dvi("Rd2.tex", pdf = (out_ext == "pdf"), quiet = FALSE, : pdflatex is not available Error in running tools::texi2dvi

Clone this wiki locally