License Objective Features Demo videos Data Ingestion Workflow Builder Bulk Data Manufacturing Web Crawler Architecture Installation Operational Metadata Management System How To Contribute
Released under Apache Public License 2.0. You can get a copy of the license at http://www.apache.org/licenses/LICENSE-2.0.
Big Data Ready Enterprise(BDRE) makes big data technology adoption simpler by optimizing and integrating various big data solutions and providing them under one integrated package. BDRE provides a unified framework for a Hadoop implementation that can drastically minimize development time and fast track the Hadoop implementation. It comprises a reusable framework that can be customized as per the enterprise ecosystem. The components are loosely integrated and can be de-coupled or replaced easily with alternatives.
The primary goal of BDRE is to accelerate Bigdata implementations by supplying the essential frameworks that are most likely to be written from scratch. It can drastically reduce effort by eliminating hundreds of man hours in operational framework development. Big Data implementations however, require specialized skills, significant development effort on data loading, semantic processing, DQ, code deployment across environments etc.
- Operational Metadata Management
- Registry of all workflow processes/templates
- Parameters/configuration(key/value) for processes
- Dependency information (upstream/downstream)
- Batch management/tracking. Batch concept in BDRE is for tracking the data flow between workflow processes.
- Run control (for delta processing/dependency check)
- Execution status for jobs(dynamic metadata - with step level granularity)
- File registry - can be used to register e.g. ingested files or a raw file as an output of an upstream.
- Execution statistics logging (key/value)
- Executed hive queries and data lineage information.
- Java APIs that integrates with Big Data with non-Big Data applications alike.
- Job monitoring and proactive/reactive alerting
- Data ingestion framework
- Tabular data from RDBMS
- Streaming data from 16 types of sources (including logs, message queues and Twitter)
- Arbitrary file ingestion by directory monitoring
- Web Crawler
- Distributed Data Manufacturing framework
- Generate billions of records based on patterns and ranges
- Semantic Layer Building Framework
- Build the semantic layer using visual workflow creator using the data you ingested.
- Supports Hive, Pig, MapReduce, Spark, R etc.
- Generates Oozie workflows
- Data Quality Framework
- Validates your data using your rules in a distributed way
- Integrated with Drools rule engine
- HTML5 User Interface
- Create ingestion, data generation, Crawler jobs or create Oozie workflows graphically without writing any code
- One click deploy and execute jobs without SSH into the edge node.
Demo video TBD
This section will help you build BDRE from source. Audience for this document are developers and architects who want be part of BDRE framework development or may just want to evaluate it.
For testing/development purpose and to save time, use the fully loaded Hadoop VMs from Cloudera or Hortonworks because all the required software are typically installed and configured.
- A Hadoop Cluster
- In this section we are using Hortonworks Sandbox 2.2.0
- Git 1.9 and up
- Maven 3 and up
- Oracle JDK 7(and up)
- BDRE is shipped with an embedded database which is okay for running the UI and evaluating and testing jobs in a single node cluster.
For production use BDRE currently supports following production scale databases.)
- MySQL Server 5.1 and up
- Oracle 11g Server or better
- PostgreSQL
- Google Chrome browser
You should be able to do the same in Mac or Windows but note that setting up a Hadoop cluster might be tricky in Windows and might require more involvement. However to deploy and run the jobs we recommend a Linux system. BDRE is typically installed in Hadoop edge node in a multi-node cluster.
-
Download and install VirtualBox from https://www.virtualbox.org/
-
Download and install Hortonworks Sandbox 2.2 Virtual Box image from http://hortonworks.com/products/releases/hdp-2-2/#install
-
Setup a 'Host-Only Adapter' for network to enable communication between Host and Guest OS.
-
Now ssh into the sandbox using root@VM_IP (password hadoop)
- The VM_IP is usually something between 192.168.56.101 - 192.168.56.109
-
Now create openbdre user account.
[root@sandbox ~]# adduser -m -s /bin/bash openbdre [root@sandbox ~]# passwd openbdre Changing password for user openbdre. New password: Retype new password: passwd: all authentication tokens updated successfully.
-
As root edit /etc/sudoers and allow openbdre to perform
sudo
. Below will do itecho "openbdre ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
-
Login to the HDP Sandbox with the newly created openbdre user. You can perform a su openbdre to switch to this account. Please make sure you are not root user beyond this point.
[root@sandbox ~]# su openbdre [openbdre@sandbox root]$ cd ~ [openbdre@sandbox ~]$
-
Download Maven from a mirror, unpack and add to the PATH.
[openbdre@sandbox ~]# wget http://www.us.apache.org/dist/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.zip [openbdre@sandbox ~]# unzip apache-maven-3.3.9-bin.zip [openbdre@sandbox ~]# export PATH=$PATH:/home/openbdre/apache-maven-3.3.9/bin
- Obtain the source code
-
cd to the home directory of openbdre.
[openbdre@sandbox ~]# cd ~
-
Pull BDRE source from this git repository. To find out your repository link navigate to the repository in this website and copy the https repo URL.
[openbdre@sandbox ~]# git clone -b predevelop https://github.com/WiproOpenSourcePractice/openbdre.git
-
cd to the cloned source dir (so you can be in /home/openbdre/openbdre)
[openbdre@sandbox ~]# cd openbdre
-
Database Setup
- Execute the dbsetup.sh script without any parameters as shown below. In this example, we are going to use MySQL as BDRE backend as it's already available in the HDP Sandbox. If you would like to use another database please select it accordingly.
[openbdre@sandbox ~]# sh dbsetup.sh
[openbdre@sandbox openbdre]$ sh dbsetup.sh⏎ Supported DB 1) Embedded (Default - Good for running BDRE user interface only. ) 2) Oracle 3) MySQL 4) PostgreSQL Select Database Type(Enter 1, 2, 3 , 4 or leave empty and press empty to select the default DB):3⏎ Enter DB username (Type username or leave it blank for default 'root'):⏎ Enter DB password (Type password or leave it blank for default '<blank>'):⏎ Enter DB hostname (Type db hostname or leave it blank for default 'localhost'):⏎ Enter DB port (Type db port or leave it blank for default '3306'):⏎ Enter DB name (Type db name or leave it blank for default 'bdre'):⏎ Enter DB schema (Type schema or leave it blank for default 'bdre'):⏎ Please confirm: Database Type: mysql JDBC Driver Class: com.mysql.jdbc.Driver JDBC Connection URL: jdbc:mysql://localhost:3306/bdre Database Username: root Database Password: Hibernate Dialect: org.hibernate.dialect.MySQLDialect Database Schema: bdre Are those correct? (type y or n - default y):y⏎ Database configuration written to ./md-dao/src/main/resources/db.properties Will create DB and tables Tables created successfully in MySQL bdre DB
-
Building
-
Now build BDRE using (note BDRE may not compile if the settings.xml is not passed from the command line so be sure to use the -s option. When building for the first time, it might take a while as maven resolves and downloads the jar libraries from different repositories.
mvn -s settings.xml clean install -P hdp22
-
Note: Selecting hdp22 will compile BDRE with HDP 2.2 libraries and automatically configure BDRE with properties from databases/setup/profile.hdp22.properties . These properties can later be altered from the BDRE Settings page under Administration.
databases/setup/profile.hdp22.properties looks like this.
bdre_user_name=openbdre
name_node_hostname=sandbox.hortonworks.com
name_node_port=8020
job_tracker_port=8050
flume_path=/usr/hdp/current/flume-server
oozie_host=sandbox.hortonworks.com
oozie_port=11000
thrift_hostname=sandbox.hortonworks.com
hive_server_hostname=sandbox.hortonworks.com
drools_hostname=sandbox.hortonworks.com
hive_jdbc_user=openbdre
hive_jdbc_password=openbdre
Building BDRE for Cloudera QuickStart VM |
---|
Similarly one should be able to build this using -P cdh52 which will configure BDRE for CDH 5.2 QuickStart VM. During building it'll pick up the environment specific configurations from |
```shell
$ mvn -s settings.xml clean install -P hdp22
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
.......blah blah.........
.......blah blah.........
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3:39.479s
[INFO] Finished at: Wed Dec 30 01:50:02 PST 2015
[INFO] Final Memory: 127M/2296M
[INFO] ------------------------------------------------------------------------
```
- Installing BDRE
-
After building BDRE successfully run
sh install-scripts.sh local
-
It'll install the BDRE scripts and artifacts in /home/openbdre/bdre
- After a successful build, start the BDRE UI service
sudo service bdre start
-
Start Oozie as the Oozie user incase Oozie isn't already started.
ps -ef | grep -i oozie
will help determine status of Oozie.su - oozie -c "/usr/hdp/current/oozie-server/bin/oozie-start.sh" ps -ef | grep -i oozie
-
Use Google Chrome browser from the host machine and open http://VM_IP:28850/mdui/pages/content.page
-
Login using admin/zaq1xsw2
- Create a RDBMS data import job from Job Setup From Template > Import from RDBMS
- Change the JDBC URL/credentials to your local MySQL DB that contains some data.
- Click Test Connection
- Expand and select 1 table (be sure to expand the tables before selecting).
- Create the jobs and see the pipeline.
- Click XML, Diagram etc. and check the generated Oozie workflow XML and diagram.
- Search for 'Process' in the search window and open the 'Process' page
- Click deploy button on process page corresponding to the process you want to deploy. (Deploy button will show status regarding deployment of process, when you hover over the button.)
- Wait for 2 minutes and the deployment will be completed by then.
- After the deployment is complete and in UI the status for the process is deployed (turns green).
- Click the execution button to execute the Import job.
- Check the process in Oozie console http://VM_IP:11000/oozie
- When the import job is complete start the data load job.
- For Airflow installation.
- use http://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh link to download anaconda 4.1.
- Run this shell script file.
- Running Airflow
-
Initialize the database
airflow initdb
-
Start the web server, default port is 8080
airflow webserver -p 8080
-
For starting the airflow UI use in the browser
localhost:8080
- For more details about airflow read http://pythonhosted.org/airflow/
BDRE provides complete job/operational metadata management solution for Hadoop. At its core acts as a registry and tracker for different types of jobs running in different Hadoop clusters or as a standalone. It provides APIs to integrate with virtually any jobs.
BDRE uses RDBMS database to store all job related metadata. A set of stored procedures are there to interface will the tables which are exposed via Java APIs to manage/create/update the static and run time metadata information. Below is the data model for BDRE metadata operational database.
Contribution for the enhancements in BDRE are welcome and humbly requested by us. To contribute, please navigate to our GitHub project page and fork BDRE main repository under your own account. You can make changes to your own forked repository and then open a Pull Request to merge your change with the main repo.
- Clone the main repo (if you havn't done already)
git clone "https://github.com/WiproOpenSourcePractice/openbdre.git"
cd openbdre
- Add your forked repo where you have write access and create your own branch.
git remote add myrepo https://<your id>:<your password>@github.com/<YOUR ACCT NAME>/openbdre.git
git checkout -b mybranch
- Make and commit your changes to your own branch.
git commit -am "My changes"
- Push to your own branch in your own remote repo (myrepo).
git push myrepo mybranch
- Everyday better pull from the main repo(origin) and sync your repo with it.
git checkout develop
git pull origin develop
-
Keep the develop branch only to have the latest main repo content. Make changes while you are in your own branch.
-
Sync your code with the main repo. Push the latest content pulled from the main repo to your own repo in your own branch.
git checkout mybranch
git merge develop
git push myrepo mybranch
- When you are ready to submit your contribution to the main repo, please open a pull request.
- Please join the community https://groups.google.com/forum/#!forum/bdre. If you have any questions/suggestions please email to [email protected] .
- If you want to report a bug, see/request a feature or work on something. Please sign up at https://openbdre.atlassian.net