A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP , Awesome Python and Awesome Sysadmin
-
Awesome Hadoop
- Hadoop
- YARN
- NoSQL
- SQL on Hadoop
- Data Management
- Workflow, Lifecycle and Governance
- Data Ingestion and Integration
- DSL
- Libraries and Tools
- Realtime Data Processing
- Distributed Computing and Programming
- Packaging, Provisioning and Monitoring
- Monitoring
- Search
- Security
- Benchmark
- Machine learning and Big Data analytics
- Misc.
- Resources
- Other Awesome Lists
- Apache Hadoop - Apache Hadoop
- Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
- SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
- Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig. ★ 701, pushed 128 days ago
- dumbo - Python module that allows you to easily write and run Hadoop programs. ★ 966, pushed 138 days ago
- hadoopy - Python MapReduce library written in Cython. ★ 224, pushed 240 days ago
- mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
- pydoop - Pydoop is a package that provides a Python API for Hadoop.
- hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system. ★ 139, pushed 1490 days ago
- White Elephant - Hadoop log aggregator and dashboard ★ 171, pushed 1041 days ago
- Kiji Project
- Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them. ★ 462, pushed 130 days ago
- Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
- Crunch - Go-based toolkit for ETL and feature extraction on Hadoop ★ 130, pushed 655 days ago
- Apache Ignite - Distributed in-memory platform
- Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
- Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
- mpich2-yarn - Running MPICH2 on Yarn ★ 77, pushed 481 days ago
Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.
- Apache HBase - Apache HBase
- Apache Phoenix - A SQL skin over HBase supporting secondary indices
- happybase - A developer-friendly Python library to interact with Apache HBase. ★ 242, pushed 155 days ago
- Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting. ★ 133, pushed 406 days ago
- Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase ★ 122, pushed 158 days ago
- hindex - Secondary Index for HBase ★ 335, pushed 609 days ago
- Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
- OpenTSDB - The Scalable Time Series Database
- Apache Cassandra
SQL on Hadoop
- Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
- Apache Phoenix A SQL skin over HBase supporting secondary indices
- Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
- Lingual - SQL interface for Cascading (MR/Tez job generator)
- Cloudera Impala
- Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
- Apache Tajo - Data warehouse system for Apache Hadoop
- Apache Drill - Schema-free SQL Query Engine
- Apache Trafodion
- Apache Calcite - A Dynamic Data Management Framework
- Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies
- Apache Oozie - Apache Oozie
- Azkaban
- Apache Falcon - Data management and processing platform
- Apache NiFi - A dataflow system
- Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
- Luigi - Python package that helps you build complex pipelines of batch jobs
- Apache Flume - Apache Flume
- Suro - Netflix's distributed Data Pipeline ★ 503, pushed 268 days ago
- Apache Sqoop - Apache Sqoop
- Apache Kafka - Apache Kafka
- Gobblin from LinkedIn - Universal data ingestion framework for Hadoop ★ 545, pushed 125 days ago
- Apache Pig - Apache Pig
- Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
- vahara - Machine learning and natural language processing with Apache Pig ★ 51, pushed 993 days ago
- packetpig - Open Source Big Data Security Analytics ★ 238, pushed 180 days ago
- akela - Mozilla's utility library for Hadoop, HBase, Pig, etc. ★ 72, pushed 889 days ago
- seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
- Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig ★ 359, pushed 268 days ago
- PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it. ★ 419, pushed 133 days ago
- Kite Software Development Kit - A set of libraries, tools, examples, and documentation
- gohadoop - Native go clients for Apache Hadoop YARN. ★ 245, pushed 445 days ago
- Hue - A Web interface for analyzing data with Apache Hadoop.
- Apache Zeppelin - A web-based notebook that enables interactive data analytics
- Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs. ★ 44, pushed 173 days ago
- Apache Thrift
- Apache Avro - Apache Avro is a data serialization system.
- Elephant Bird - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code. ★ 962, pushed 153 days ago
- Spring for Apache Hadoop
- hdfs - A native go client for HDFS ★ 204, pushed 146 days ago
- Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.
- snakebite
- Apache Storm
- Apache Samza
- Apache Spark
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
-
Apache Spark
- Spark Packages - A community index of packages for Apache Spark
- SparkHub - A community site for Apache Spark
- Apache Crunch
- Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
-
Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.
-
Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
- Apache Ambari - Apache Ambari
- Ganglia Monitoring System
- ankush - A big data cluster management tool that creates and manages clusters of different technologies. ★ 19, pushed 504 days ago
- Apache Zookeeper - Apache Zookeeper
- Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
- Buildoop - Hadoop Ecosystem Builder ★ 27, pushed 348 days ago
- Deploop - The Hadoop Deploy System
- Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
- inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization. ★ 122, pushed 268 days ago
- ElasticSearch
- Apache Solr
- SenseiDB - Open-source, distributed, realtime, semi-structured database
- Banana - Kibana port for Apache Solr ★ 310, pushed 130 days ago
- Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.
- Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
- Apache Sentry - An authorization module for Hadoop
- Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
- Big Data Benchmark
- HiBench ★ 358, pushed 138 days ago
- Big-Bench ★ 30, pushed 186 days ago
- hive-benchmarks ★ 2, pushed 819 days ago
- hive-testbench - Testbench for experimenting with Apache Hive at any data scale. ★ 32, pushed 159 days ago
- YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems. ★ 1179, pushed 127 days ago
- Apache Mahout
- Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning ★ 777, pushed 134 days ago
- MLlib - MLlib is Apache Spark's scalable machine learning library.
- R - R is a free software environment for statistical computing and graphics.
- RHadoop including RHDFS, RHBase, RMR2, plyrmr
- RHive RHive, for launching Hive queries from R ★ 108, pushed 399 days ago
- Apache Lens
- Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
-
Hive Plugins
-
UDF
- http://nexr.github.io/hive-udf/
- https://github.com/edwardcapriolo/hive cassandra udfs
- https://github.com/livingsocial/HiveSwarm
- https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
- https://github.com/karthkk/udfs
- https://github.com/twitter/elephant-bird - Twitter
- https://github.com/lovelysystems/ls-hive
- https://github.com/stewi2/hive-udfs
- https://github.com/klout/brickhouse
- https://github.com/markgrover/hive-translate (PostgreSQL translate())
- https://github.com/deanwampler/HiveUDFs
- https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
- https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
- https://github.com/Netflix/Surus
-
Storage Handler
- https://github.com/dvasilen/Hive-Cassandra
- https://github.com/yc-huang/Hive-mongo
- https://github.com/balshor/gdata-storagehandler
- https://github.com/karthkk/hive-hbase-json
- https://github.com/sunsuk7tp/hive-hbase-integration
- https://bitbucket.org/rodrigopr/redisstoragehandler
- https://github.com/zhuguangbin/HiveJDBCStorageHanlder
- https://github.com/chimpler/hive-solr
- https://github.com/bfemiano/accumulo-hive-storage-manager
-
SerDe
- https://github.com/rcongiu/Hive-JSON-Serde
- https://github.com/mochi/hive-json-serde
- https://github.com/ogrodnek/csv-serde
- https://github.com/parag/HiveJsonSerde
- https://github.com/johanoskarsson/hive-json-serde
- https://github.com/electrum/hive-serde - JSON
- https://github.com/karthkk/hive-hbase-json
-
Libraries and tools
- https://github.com/forward3d/rbhive
- https://github.com/synctree/activerecord-hive-adapter
- https://github.com/hrp/sequel-hive-adapter
- https://github.com/forward/node-hive
- https://github.com/recruitcojp/WebHive
- shib - WebUI for query engines: Hive and Presto
- clive - Clojure library for interacting with Hive via Thrift ★ 4, pushed 1620 days ago
- https://github.com/anjuke/hwi
- https://code.google.com/a/apache-extras.org/p/hipy/
- https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
- PyHive - Python interface to Hive and Presto ★ 179, pushed 146 days ago
- https://github.com/recruitcojp/OdbcHive
- Hive-Sharp
- HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4 ★ 57, pushed 131 days ago
- Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers. ★ 46, pushed 311 days ago
- Hive_test - Unit test framework for hive and hive-service ★ 55, pushed 706 days ago
-
UDF
-
Flume Plugins
- Flume MongoDB Sink
- Flume HornetQ Channel ★ 0, pushed 1458 days ago
- Flume MessagePack Source ★ 0, pushed 1211 days ago
- Flume RabbitMQ source and sink ★ 40, pushed 186 days ago
- Flume UDP Source ★ 5, pushed 873 days ago
- Stratio Ingestion - Custom sinks: Cassandra, MongoDB, Stratio Streaming and JDBC ★ 111, pushed 145 days ago
- Flume Custom Serializers ★ 2, pushed 1260 days ago
- Real-time analytics in Apache Flume ★ 39, pushed 216 days ago
- .Net FlumeNG Clients ★ 14, pushed 780 days ago
Various resources, such as books, websites and articles.
Useful websites and articles
- Hadoop Weekly
- The Hadoop Ecosystem Table
- Hadoop 1.x vs 2
- Apache Hadoop YARN: Yet Another Resource Negotiator
- Introducing Apache Hadoop YARN
- Apache Hadoop YARN - Background and an Overview
- Apache Hadoop YARN - Concepts and Applications
- Apache Hadoop YARN - ResourceManager
- Apache Hadoop YARN - NodeManager
- Migrating to MapReduce 2 on YARN (For Users)
- Migrating to MapReduce 2 on YARN (For Operators)
- Hadoop and Big Data: Use Cases at Salesforce.com
- All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.
- What is Bigtop, and Why Should You Care?
- Hadoop - Distributions and Commercial Support
- Ganglia configuration for a small Hadoop cluster and some troubleshooting
- Hadoop illuminated - Open Source Hadoop Book
- NoSQL Database
- 10 Best Practices for Apache Hive
- Hadoop Operations at Scale
- AWS BigData Blog
- Hadoop360
- How to monitor Hadoop metrics
- Hadoop Summit Presentations - Slide decks from Hadoop Summit
- Hadoop 24/7
- An example Apache Hadoop Yarn upgrade
- Apache Hadoop In Theory And Practice
- Hadoop Operations at LinkedIn
- Hadoop Performance at LinkedIn
- Docker based Hadoop provisioning
- Hadoop: The Definitive Guide
- Hadoop Operations
- Apache Hadoop Yarn
- HBase: The Definitive Guide
- Programming Pig
- Programming Hive
- Hadoop in Practice, Second Edition
- Hadoop in Action, Second Edition
Other amazingly awesome lists can be found in the awesome-awesomeness and awesome list.