Awesome Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP , Awesome Python and Awesome Sysadmin

Awesome Hadoop
Resources
Other Awesome Lists

Hadoop

Apache Hadoop - Apache Hadoop
Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig. ^{★ 701, pushed 128 days ago}
dumbo - Python module that allows you to easily write and run Hadoop programs. ^{★ 966, pushed 138 days ago}
hadoopy - Python MapReduce library written in Cython. ^{★ 224, pushed 240 days ago}
mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
pydoop - Pydoop is a package that provides a Python API for Hadoop.
hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system. ^{★ 139, pushed 1490 days ago}
White Elephant - Hadoop log aggregator and dashboard ^{★ 171, pushed 1041 days ago}
Kiji Project
Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them. ^{★ 462, pushed 130 days ago}
Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Crunch - Go-based toolkit for ETL and feature extraction on Hadoop ^{★ 130, pushed 655 days ago}
Apache Ignite - Distributed in-memory platform

YARN

Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
mpich2-yarn - Running MPICH2 on Yarn ^{★ 77, pushed 481 days ago}

NoSQL

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

Apache HBase - Apache HBase
Apache Phoenix - A SQL skin over HBase supporting secondary indices
happybase - A developer-friendly Python library to interact with Apache HBase. ^{★ 242, pushed 155 days ago}
Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting. ^{★ 133, pushed 406 days ago}
Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase ^{★ 122, pushed 158 days ago}
hindex - Secondary Index for HBase ^{★ 335, pushed 609 days ago}
Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
OpenTSDB - The Scalable Time Series Database
Apache Cassandra

SQL on Hadoop

SQL on Hadoop

Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
Apache Phoenix A SQL skin over HBase supporting secondary indices
Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
Lingual - SQL interface for Cascading (MR/Tez job generator)
Cloudera Impala
Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
Apache Tajo - Data warehouse system for Apache Hadoop
Apache Drill - Schema-free SQL Query Engine
Apache Trafodion

Data Management

Apache Calcite - A Dynamic Data Management Framework
Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies

Workflow, Lifecycle and Governance

Apache Oozie - Apache Oozie
Azkaban
Apache Falcon - Data management and processing platform
Apache NiFi - A dataflow system
Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
Luigi - Python package that helps you build complex pipelines of batch jobs

Data Ingestion and Integration

Apache Flume - Apache Flume
Suro - Netflix's distributed Data Pipeline ^{★ 503, pushed 268 days ago}
Apache Sqoop - Apache Sqoop
Apache Kafka - Apache Kafka
Gobblin from LinkedIn - Universal data ingestion framework for Hadoop ^{★ 545, pushed 125 days ago}

DSL

Apache Pig - Apache Pig
Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
vahara - Machine learning and natural language processing with Apache Pig ^{★ 51, pushed 993 days ago}
packetpig - Open Source Big Data Security Analytics ^{★ 238, pushed 180 days ago}
akela - Mozilla's utility library for Hadoop, HBase, Pig, etc. ^{★ 72, pushed 889 days ago}
seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig ^{★ 359, pushed 268 days ago}
PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it. ^{★ 419, pushed 133 days ago}

Libraries and Tools

Kite Software Development Kit - A set of libraries, tools, examples, and documentation
gohadoop - Native go clients for Apache Hadoop YARN. ^{★ 245, pushed 445 days ago}
Hue - A Web interface for analyzing data with Apache Hadoop.
Apache Zeppelin - A web-based notebook that enables interactive data analytics
Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs. ^{★ 44, pushed 173 days ago}
Apache Thrift
Apache Avro - Apache Avro is a data serialization system.
Elephant Bird - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code. ^{★ 962, pushed 153 days ago}
Spring for Apache Hadoop
hdfs - A native go client for HDFS ^{★ 204, pushed 146 days ago}
Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.
snakebite

Realtime Data Processing

Apache Storm
Apache Samza
Apache Spark
Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.

Distributed Computing and Programming

Apache Spark
- Spark Packages - A community index of packages for Apache Spark
- SparkHub - A community site for Apache Spark
Apache Crunch
Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.

Packaging, Provisioning and Monitoring
Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
Apache Ambari - Apache Ambari
Ganglia Monitoring System
ankush - A big data cluster management tool that creates and manages clusters of different technologies. ^{★ 19, pushed 504 days ago}
Apache Zookeeper - Apache Zookeeper
Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
Buildoop - Hadoop Ecosystem Builder ^{★ 27, pushed 348 days ago}
Deploop - The Hadoop Deploy System
Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization. ^{★ 122, pushed 268 days ago}

Search

ElasticSearch
Apache Solr
SenseiDB - Open-source, distributed, realtime, semi-structured database
Banana - Kibana port for Apache Solr ^{★ 310, pushed 130 days ago}

Search Engine Framework

Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.

Security

Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
Apache Sentry - An authorization module for Hadoop
Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.

Benchmark

Big Data Benchmark
HiBench ^{★ 358, pushed 138 days ago}
Big-Bench ^{★ 30, pushed 186 days ago}
hive-benchmarks ^{★ 2, pushed 819 days ago}
hive-testbench - Testbench for experimenting with Apache Hive at any data scale. ^{★ 32, pushed 159 days ago}
YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems. ^{★ 1179, pushed 127 days ago}

Machine learning and Big Data analytics

Apache Mahout
Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning ^{★ 777, pushed 134 days ago}
MLlib - MLlib is Apache Spark's scalable machine learning library.
R - R is a free software environment for statistical computing and graphics.
RHadoop including RHDFS, RHBase, RMR2, plyrmr
RHive RHive, for launching Hive queries from R ^{★ 108, pushed 399 days ago}
Apache Lens
Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

Misc.

Hive Plugins
- UDF
  - http://nexr.github.io/hive-udf/
  - https://github.com/edwardcapriolo/hive cassandra udfs
  - https://github.com/livingsocial/HiveSwarm
  - https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
  - https://github.com/karthkk/udfs
  - https://github.com/twitter/elephant-bird - Twitter
  - https://github.com/lovelysystems/ls-hive
  - https://github.com/stewi2/hive-udfs
  - https://github.com/klout/brickhouse
  - https://github.com/markgrover/hive-translate (PostgreSQL translate())
  - https://github.com/deanwampler/HiveUDFs
  - https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
  - https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
  - https://github.com/Netflix/Surus
- Storage Handler
- SerDe
- Libraries and tools
  - https://github.com/forward3d/rbhive
  - https://github.com/synctree/activerecord-hive-adapter
  - https://github.com/hrp/sequel-hive-adapter
  - https://github.com/forward/node-hive
  - https://github.com/recruitcojp/WebHive
  - shib - WebUI for query engines: Hive and Presto
  - clive - Clojure library for interacting with Hive via Thrift ^{★ 4, pushed 1620 days ago}
  - https://github.com/anjuke/hwi
  - https://code.google.com/a/apache-extras.org/p/hipy/
  - https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
  - PyHive - Python interface to Hive and Presto ^{★ 179, pushed 146 days ago}
  - https://github.com/recruitcojp/OdbcHive
  - Hive-Sharp
  - HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4 ^{★ 57, pushed 131 days ago}
  - Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers. ^{★ 46, pushed 311 days ago}
  - Hive_test - Unit test framework for hive and hive-service ^{★ 55, pushed 706 days ago}
^{★ 98, pushed 147 days ago}
Flume Plugins
- Flume MongoDB Sink
- Flume HornetQ Channel ^{★ 0, pushed 1458 days ago}
- Flume MessagePack Source ^{★ 0, pushed 1211 days ago}
- Flume RabbitMQ source and sink ^{★ 40, pushed 186 days ago}
- Flume UDP Source ^{★ 5, pushed 873 days ago}
- Stratio Ingestion - Custom sinks: Cassandra, MongoDB, Stratio Streaming and JDBC ^{★ 111, pushed 145 days ago}
- Flume Custom Serializers ^{★ 2, pushed 1260 days ago}
- Real-time analytics in Apache Flume ^{★ 39, pushed 216 days ago}
- .Net FlumeNG Clients ^{★ 14, pushed 780 days ago}
^{★ 34, pushed 144 days ago}

Resources

Various resources, such as books, websites and articles.

Websites

Useful websites and articles

Hadoop Weekly
The Hadoop Ecosystem Table
Hadoop 1.x vs 2
Apache Hadoop YARN: Yet Another Resource Negotiator
Introducing Apache Hadoop YARN
Apache Hadoop YARN - Background and an Overview
Apache Hadoop YARN - Concepts and Applications
Apache Hadoop YARN - ResourceManager
Apache Hadoop YARN - NodeManager
Migrating to MapReduce 2 on YARN (For Users)
Migrating to MapReduce 2 on YARN (For Operators)
Hadoop and Big Data: Use Cases at Salesforce.com
All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.
What is Bigtop, and Why Should You Care?
Hadoop - Distributions and Commercial Support
Ganglia configuration for a small Hadoop cluster and some troubleshooting
Hadoop illuminated - Open Source Hadoop Book
NoSQL Database
10 Best Practices for Apache Hive
Hadoop Operations at Scale
AWS BigData Blog
Hadoop360
How to monitor Hadoop metrics

Presentations

Hadoop Summit Presentations - Slide decks from Hadoop Summit
Hadoop 24/7
An example Apache Hadoop Yarn upgrade
Apache Hadoop In Theory And Practice
Hadoop Operations at LinkedIn
Hadoop Performance at LinkedIn
Docker based Hadoop provisioning

Books

Hadoop: The Definitive Guide
Hadoop Operations
Apache Hadoop Yarn
HBase: The Definitive Guide
Programming Pig
Programming Hive
Hadoop in Practice, Second Edition
Hadoop in Action, Second Edition

Hadoop and Big Data Events

ApacheCon
Strata + Hadoop World
Hadoop Summit

Other Awesome Lists

Other amazingly awesome lists can be found in the awesome-awesomeness and awesome list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome-hadoop.md

awesome-hadoop.md

Awesome Hadoop

Hadoop

YARN

NoSQL

SQL on Hadoop

Data Management

Workflow, Lifecycle and Governance

Data Ingestion and Integration

DSL

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

Packaging, Provisioning and Monitoring

Search

Search Engine Framework

Security

Benchmark

Machine learning and Big Data analytics

Misc.

Resources

Websites

Presentations

Books

Hadoop and Big Data Events

Other Awesome Lists

Files

awesome-hadoop.md

Latest commit

History

awesome-hadoop.md

File metadata and controls

Awesome Hadoop

Hadoop

YARN

NoSQL

SQL on Hadoop

Data Management

Workflow, Lifecycle and Governance

Data Ingestion and Integration

DSL

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

Packaging, Provisioning and Monitoring

Search

Search Engine Framework

Security

Benchmark

Machine learning and Big Data analytics

Misc.

Resources

Websites

Presentations

Books

Hadoop and Big Data Events

Other Awesome Lists