Skip to content

Based off the design of SparkOnHBase. This Repo will support Spark, Spark Streaming, and Spark SQL integration with Kudu.

License

Notifications You must be signed in to change notification settings

jdcryans/SparkOnKudu

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparkOnKudu

Overview

This is a simple reusable lib for working with Kudu with Spark

##Functionality Current functionality supports the following functions

  • RDD foreachPartition with iterator and kudu client
  • RDD mapPartition with iterator and kudu client
  • DStream foreachPartition with iterator and kudu client
  • DStream mapPartition with iterator and kudu client
  • Spark SQL integration with Kudu (Basic no filter push down yet)

##Examples

  • Basic example ** Creating Kudu tables ** Connecting with SparkSQL ** Converting values from Kudu to SparkSQL to MlLib
  • Gamer example ** Creating Kudu Gamer table ** Generating Gamer data and pushing it to Kafka ** Reading Gamer data from Kafka with Spark Streaming ** Aggregating Gamer data in Spark Streaming then pushing mutations to Kudu ** Running Impala SQL on Kudu Gamer table ** Running SparkSQL on Kudu Gamer table ** Converting SparkSQL results to Vectors so we can do KMeans

##Near Future

  • Key SQL predict push down
  • Need to update POM file with public repo
  • Need to work with Kudu project to integrate into Kudu

##Build mvn clean package

##Setup for Gamer Example

###Setting up Kafka kafka-topics --zookeeper ZooKeeperNode:2181 --create --topic gamer --partitions 1 --replication-factor 1 kafka-topics --zookeeper ZooKeeperNode:2181 --list

###Basic Testing with Kafka kafka-console-producer --broker-list BrokerNode:9092 --topic test kafka-cocsole-consumer --zookeeper ZooKeeperNode:2181 --topic gamer --from-beginning

###Populating Kafka java -cp KuduSpark.jar org.kududb.spark.demo.gamer.KafkaProducerGenerator mriggs-strata-1.vpc.cloudera.com:9092 gamer 10000 300 1000

###create Table java -cp KuduSpark.jar org.kududb.spark.demo.gamer.CreateKuduTable mriggs-strata-1.vpc.cloudera.com gamer

###Run Spark Streaming spark-submit
--master yarn --deploy-mode client
--executor-memory 2G
--num-executors 2
--jars kudu-mapreduce-0.1.0-20150903.033037-21-jar-with-dependencies.jar
--class org.kududb.spark.demo.gamer.GamerAggergatesSparkStreaming KuduSpark.jar
BrokerNode:9092 gamer KuduMaster gamer C

About

Based off the design of SparkOnHBase. This Repo will support Spark, Spark Streaming, and Spark SQL integration with Kudu.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 100.0%