GitHub - griddb/griddb_spark: GridDB connector for Apache Spark

GridDB connector for Apache Spark

Overview

GridDB connector for Apache Spark is a module supporting connection between GridDB and Apache Spark. This uses GridDB server, GridDB Java client, and GridDB connector for Apache Hadoop MapReduce. We can create DataFrame from an existing GridDB container and operate with it.

Operating environment

Library building and program execution are checked in the environment below.

OS:             CentOS6.7(x64)
Java:           JDK 1.8.0_101
Apache Hadoop:  Version 2.6.5
Apache Spark:   Version 2.1.0
Scala:          Version 2.11.8

GridDB server and Java client:                3.0 CE
GridDB connector for Apache Hadoop MapReduce: 1.0

QuickStart

Preparations

Install Hadoop and Spark

 $ cd [INSTALL_FOLDER]
 $ wget http://archive.apache.org/dist/hadoop/core/hadoop-2.6.5/hadoop-2.6.5.tar.gz
 $ tar xvfz hadoop-2.6.5.tar.gz
 $ wget http://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.6.tgz
 $ tar xvfz spark-2.1.0-bin-hadoop2.6.tgz

Note: [INSTALL_FOLDER] means the folder installed for Spark, Hadoop and GridDB connector for Spark.

Please add the following environment variables to .bashrc

 $ vi ~/.bashrc
 export JAVA_HOME=/usr/lib/jvm/[JDK folder]
 export HADOOP_HOME=[INSTALL_FOLDER]/hadoop-2.6.5
 export SPARK_HOME=[INSTALL_FOLDER]/spark-2.1.0-bin-hadoop2.6
 export GRIDDB_SPARK=[INSTALL_FOLDER]/griddb_spark
 export GRIDDB_SPARK_PROPERTIES=$GRIDDB_SPARK/gd-config.xml
 
 export PATH=$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH
 
 export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
 export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"

 $ source ~/.bashrc

Please modify file "gd-config.xml"

 $ cd [INSTALL_FOLDER]/griddb_spark
 $ vi gd-config.xml
 
 <!-- GridDB properties -->
 <property>
 	<name>gs.user</name>
 	<value>[GridDB user]</value>
 </property>
 <property>
 	<name>gs.password</name>
 	<value>[GridDB password]</value>
 </property>
 <property>
 	<name>gs.cluster.name</name>
 	<value>[GridDB cluster name]</value>
 </property>
 <!-- Define address and port for multicast method, leave it blank if using other method -->
 <property>
 	<name>gs.notification.address</name>
 	<value>[GridDB notification address(default is 239.0.0.1)]</value>
 </property>
 <property>
 	<name>gs.notification.port</name>
 	<value>[GridDB notification port(default is 31999)]</value>
 </property>

Please refer to Configuration for GridDB properties.

Build a GridDB Java client and a GridDB connector for Hadoop MapReduce,
place the following files under the griddb_spark/gs-spark-datasource/lib directory.

gridstore.jar
gs-hadoop-mapreduce-client-1.0.0.jar

Add SPARK_CLASSPATH to "spark-env.sh"

 $ cd [INSTALL_FOLDER]/spark-2.1.0-bin-hadoop2.6
 $ vi conf/spark-env.sh
 SPARK_CLASSPATH=.:$GRIDDB_SAPRK/gs-spark-datasource/target/gs-spark-datasource.jar:
 	$GRIDDB_SAPRK/gs-spark-datasource/lib/gridstore.jar:
 	$GRIDDB_SAPRK/gs-spark-datasource/lib/gs-hadoop-mapreduce-client-1.0.0.jar

Build the connector and an example

Run the mvn command like the following:

$ cd [INSTALL_FOLDER]/griddb_spark
$ mvn package

and create the following jar files.

gs-spark-datasource/target/gs-spark-datasource.jar
gs-spark-datasource-example/target/example.jar

Run the example program

GridDB cluster needs to be started in advance.

Put data to server with GridDB Java client

 $ cd [INSTALL_FOLDER]/griddb_spark
 $ java -cp ./gs-spark-datasource-example/target/example.jar:gs-spark-datasource/lib/gridstore.jar 
 	Init <GridDB notification address> <GridDB notification port>
 				<GridDB cluster name> <GridDB user> <GridDB password>

Run some queries with GridDB connector for Spark

 $ spark-submit --class Query ./gs-spark-datasource-example/target/example.jar

API

With a SparkSession, applications can create DataFrames from an existing GridDB container in the form as bellow.

var df = session.read.format("com.toshiba.mwcloud.gs.spark.datasource").load(containerName)

Community

Issues
Use the GitHub issue function if you have any requests, questions, or bug reports.
PullRequest
Use the GitHub pull request function if you want to contribute code. You'll need to agree GridDB Contributor License Agreement(CLA_rev1.1.pdf). By using the GitHub pull request function, you shall be deemed to have agreed to GridDB Contributor License Agreement.

License

The GridDB connector source is licensed under the Apache License, version 2.0.

Trademarks

Apache Spark, Apache Hadoop, Spark, and Hadoop are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
gs-spark-datasource-example		gs-spark-datasource-example
gs-spark-datasource		gs-spark-datasource
APACHE-2.0.txt		APACHE-2.0.txt
CLA_rev1.1.pdf		CLA_rev1.1.pdf
Configuration.md		Configuration.md
LICENSE		LICENSE
README.md		README.md
gd-config.xml		gd-config.xml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Operating environment

QuickStart

Preparations

Build the connector and an example

Run the example program

API

Community

License

Trademarks

About

Releases 2

Packages

Contributors 2

Languages

License

griddb/griddb_spark

Folders and files

Latest commit

History

Repository files navigation

Overview

Operating environment

QuickStart

Preparations

Build the connector and an example

Run the example program

API

Community

License

Trademarks

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages