This repository contains the docker images and the Openshift code to bring up a a complete HASZ environment.
Please complete this document if you find errors or lack information. Just git push it! :)
The general procedure to bring up the components is:
- Start minishift as stated in its documentation here
- If you have an Openshift cluster already installed, log in and select or create a project for the deployment of the components
- If you're using docker locally, build the images for each component using the provided Dockerfiles
- On minishift/openshif enter the oc folder and execute the oc-minishift.sh / oc-cluster.sh
- If you're using docker, enter the docker folder and use the scripts of each component to deploy it.
Also please read each folder README to get more details of each component.
Because this is used for development, tweaking images, etc. its recommended to be generous with the minishift parameters.
minishift start --vm-driver=kvm --memory 10480 --cpus 4 --disk-size 100g
Docker images are optimized for development, and not for deployment, that is, images build fast, but require more disk space. The 20g of the the default vm minishift brings up becomes full in two or three builds.
The folders called hdfs
, alluxio
, spark
and zeppelin
contain the Dockerfiles and boot.sh script for each container.
We tried to not make assumptions about the platform that will run this images, so they should work in a local docker installation, kubernetes, openshift or whatever. Of course the programs from the images have certain communications and storage requirements and you MUST tailor them for your needs.
Also you need to be familiar with HDFS, alluxio, spark and zeppelin. Do not expect everything to work without reading first how those programs operate in a general way.
The images also are prepared for graceful shutdown of each component.
Finally, its important to note that an image could act as a different component, normally master or worker of its software component.
In general, this script is composed of three sections:
- environment set up and parameters parsing
- handlers definition
- starting code
All scripts accept similar general syntax:
boot.sh type action name
Where type
is the node type like namenode, master, worker, etc. action
is start, stop, status, etc. and name is used as the master's name, either for set up or to join the cluster.
All Dockerfiles are based in the official ubuntu docker image, and contains the minimum commands required to run the software. Please, be aware that some tools might be lacking like ip utils, dns utils, etc. If you need to customize the image for debugging purposes, either modify the Dockerfile or compose your image from one of ours.
All components are configured using their correspondent configuration files which are included in the images at build time. There are no dynamic configuration tools or support for dynamic configuration storages like consul, etcd, S3, etc. So every time a change of config is needed, a new version of the image must be build.
Dockerfiles are layered for this purpose, so rebuild an image with a config change is cheap in space and time.
This approach ensures full compatibility with whatever system as far as it supports docker images.
Data locality is achieved by naming everything. All the components should be reachable by name, and that name needs to be also the hostname of the component.
To achieve this in openshift, several assumptions are made:
- workers of hdfs, alluxio and spark run together on the same pod, sharing hostname and ip address
- all workers have an openshift service of their own to be able to communicate with the rest of the cluster
- the openshift service and the hostname of the pod must be equal
- all the nodes that accept as a parameter its local name, should be set up that way using the FQDN, for example with the output of
hostname -f
Data locality is only achieved between all the components only when using PODs, as all the workers must share hostname in order to be aware of the data locality.
TODO: Testing in docker/swarm environment needs to be done to clarify data locality options on this environment.
Files for each image is contained in their own folder.
Contains the version of the complete hadoop 2.7.3 distribution.
Documentation specific to this component can be found here.
Contains the version of the complete alluxio 1.4.0 distribution. It is uncompressed in /opt/alluxio.
Documentation specific to this component can be found here.
Contains the version of the complete spark 2.1.0 distribution with hdfs 2.7 build.
Documentation specific to this component can be found here.
Contains the version of the complete zeppelin 0.7.1 binary distribution with all interpreters (~700MB).
Documentation specific to this component can be found here.
Contains the version of the complete spark 2.1.0 distribution with hdfs 2.7 build and default configuration and scripts to run spark jobs on the cluster.
Documentation specific to this component can be found here.
The folder oc
contains all the openshift code to bring up the deployments, routers, persistent volumes, etc. for all the components of the system.
Documentation specific to this component can be found here.
This folder contains scripts to download and extend datasets for test purposes. It also contains 2 scripts to manipulate HDFS and Alluxio filesystems from the command line, making use of their HTTP API.
Documentation specific to this component can be found here.
This folder contains a synthetic benchmark aimed to check performance of the cluster in several scenarios. It's based on DFSIO benchmark for HDFS and adapted to work in this environment using Spark.
Documentation specific to this component can be found here.
This folder contains all the scripts used to bring up the whole thing in a docker installation under linux.
Documentation specific to this component can be found here.