Skip to content

Docker images and deployment configurations for a cluster of HDFS, Alluxio and Spark. Focusing on data locality. Support Openshift 3.4, and more comming.

License

Notifications You must be signed in to change notification settings

laetitiae/docker-hdfs-alluxio-spark

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HDFS ALLUXIO SPARK ZEPPELIN

This repository contains the docker images and the Openshift code to bring up a a complete HASZ environment.

Please complete this document if you find errors or lack information. Just git push it! :)

Getting started

The general procedure to bring up the components is:

  • Start minishift as stated in its documentation here
  • If you have an Openshift cluster already installed, log in and select or create a project for the deployment of the components
  • If you're using docker locally, build the images for each component using the provided Dockerfiles
  • On minishift/openshif enter the oc folder and execute the oc-minishift.sh / oc-cluster.sh
  • If you're using docker, enter the docker folder and use the scripts of each component to deploy it.

Also please read each folder README to get more details of each component.

Local environment: minishift

Because this is used for development, tweaking images, etc. its recommended to be generous with the minishift parameters.

minishift start --vm-driver=kvm --memory 10480 --cpus 4 --disk-size 100g

Docker images are optimized for development, and not for deployment, that is, images build fast, but require more disk space. The 20g of the the default vm minishift brings up becomes full in two or three builds.

Docker images

The folders called hdfs, alluxio, spark and zeppelin contain the Dockerfiles and boot.sh script for each container.

We tried to not make assumptions about the platform that will run this images, so they should work in a local docker installation, kubernetes, openshift or whatever. Of course the programs from the images have certain communications and storage requirements and you MUST tailor them for your needs.

Also you need to be familiar with HDFS, alluxio, spark and zeppelin. Do not expect everything to work without reading first how those programs operate in a general way.

The images also are prepared for graceful shutdown of each component.

Finally, its important to note that an image could act as a different component, normally master or worker of its software component.

About boot.sh

In general, this script is composed of three sections:

  • environment set up and parameters parsing
  • handlers definition
  • starting code

All scripts accept similar general syntax:

boot.sh type action name

Where type is the node type like namenode, master, worker, etc. action is start, stop, status, etc. and name is used as the master's name, either for set up or to join the cluster.

About dockerfiles

All Dockerfiles are based in the official ubuntu docker image, and contains the minimum commands required to run the software. Please, be aware that some tools might be lacking like ip utils, dns utils, etc. If you need to customize the image for debugging purposes, either modify the Dockerfile or compose your image from one of ours.

About configurations

All components are configured using their correspondent configuration files which are included in the images at build time. There are no dynamic configuration tools or support for dynamic configuration storages like consul, etcd, S3, etc. So every time a change of config is needed, a new version of the image must be build.

Dockerfiles are layered for this purpose, so rebuild an image with a config change is cheap in space and time.

This approach ensures full compatibility with whatever system as far as it supports docker images.

About data locality and pods

Data locality is achieved by naming everything. All the components should be reachable by name, and that name needs to be also the hostname of the component.

To achieve this in openshift, several assumptions are made:

  • workers of hdfs, alluxio and spark run together on the same pod, sharing hostname and ip address
  • all workers have an openshift service of their own to be able to communicate with the rest of the cluster
  • the openshift service and the hostname of the pod must be equal
  • all the nodes that accept as a parameter its local name, should be set up that way using the FQDN, for example with the output of hostname -f

Data locality is only achieved between all the components only when using PODs, as all the workers must share hostname in order to be aware of the data locality.

TODO: Testing in docker/swarm environment needs to be done to clarify data locality options on this environment.

Dockerfiles

Files for each image is contained in their own folder.

HDFS

Contains the version of the complete hadoop 2.7.3 distribution.

Documentation specific to this component can be found here.

Alluxio

Contains the version of the complete alluxio 1.4.0 distribution. It is uncompressed in /opt/alluxio.

Documentation specific to this component can be found here.

Spark

Contains the version of the complete spark 2.1.0 distribution with hdfs 2.7 build.

Documentation specific to this component can be found here.

Zeppelin

Contains the version of the complete zeppelin 0.7.1 binary distribution with all interpreters (~700MB).

Documentation specific to this component can be found here.

Spark Submitter

Contains the version of the complete spark 2.1.0 distribution with hdfs 2.7 build and default configuration and scripts to run spark jobs on the cluster.

Documentation specific to this component can be found here.

Openshift

The folder oc contains all the openshift code to bring up the deployments, routers, persistent volumes, etc. for all the components of the system.

Documentation specific to this component can be found here.

Data

This folder contains scripts to download and extend datasets for test purposes. It also contains 2 scripts to manipulate HDFS and Alluxio filesystems from the command line, making use of their HTTP API.

Documentation specific to this component can be found here.

Benchmarks

This folder contains a synthetic benchmark aimed to check performance of the cluster in several scenarios. It's based on DFSIO benchmark for HDFS and adapted to work in this environment using Spark.

Documentation specific to this component can be found here.

Docker

This folder contains all the scripts used to bring up the whole thing in a docker installation under linux.

Documentation specific to this component can be found here.

About

Docker images and deployment configurations for a cluster of HDFS, Alluxio and Spark. Focusing on data locality. Support Openshift 3.4, and more comming.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Shell 70.6%
  • Jupyter Notebook 23.3%
  • Scala 3.9%
  • Python 2.2%