Skip to content

Latest commit

 

History

History

demo

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

ArangoDB Spark Datasource Demo

This demo is composed of 3 parts:

  • WriteDemo: reads the input json files as Spark Dataframes, applies conversions to map the data to Spark data types and writes the records into ArangoDB collections
  • ReadDemo: reads the ArangoDB collections created above as Spark Dataframes, specifying columns selection and records filters predicates or custom AQL queries
  • ReadWriteDemo: reads the ArangoDB collections created above as Spark Dataframes, applies projections and filtering, writes to a new ArangoDB collection

There are demos available written in Scala & Python (using PySpark) as outlined below.

Requirements

This demo requires:

  • JDK 8, 11 or 17
  • maven
  • docker

For the python demo, you will also need

  • python

Prepare the environment

Set environment variables:

export ARANGO_SPARK_VERSION=1.8.0

Start ArangoDB cluster with docker:

SSL=true STARTER_MODE=cluster ./docker/start_db.sh

The deployed cluster will be accessible at https://172.28.0.1:8529 with username root and password test.

Start Spark cluster:

./docker/start_spark.sh 

Install locally

NB: this is only needed for SNAPSHOT versions.

mvn -f ../pom.xml install -Dmaven.test.skip=true -Dgpg.skip=true -Dmaven.javadoc.skip=true -Pscala-2.12 -Pspark-3.5

Run embedded

Test the Spark application in embedded mode:

mvn \
  -Pscala-2.12 -Pspark-3.5 \
  test

Test the Spark application against ArangoDB Oasis deployment:

mvn \
  -Pscala-2.12 -Pspark-3.5 \
  -Dpassword=<root-password> \
  -Dendpoints=<endpoint> \
  -Dssl.cert.value=<base64-encoded-cert> \
  test

Submit to Spark cluster

Package the application:

mvn package -Dmaven.test.skip=true -Pscala-2.12 -Pspark-3.5

Submit demo program:

docker run -it --rm \
  -v $(pwd):/demo \
  -v $(pwd)/docker/.ivy2:/opt/bitnami/spark/.ivy2 \
  -v $HOME/.m2/repository:/opt/bitnami/spark/.m2/repository \
  --network arangodb \
  docker.io/bitnami/spark:3.5.2 \
  ./bin/spark-submit --master spark://spark-master:7077 \
    --packages="com.arangodb:arangodb-spark-datasource-3.5_2.12:$ARANGO_SPARK_VERSION" \
    --class Demo /demo/target/demo-$ARANGO_SPARK_VERSION.jar    

Python(PySpark) Demo

This demo requires the same environment setup as outlined above. Additionally, the python requirements will need to be installed as follows:

pip install -r ./python-demo/requirements.txt

To run the PySpark demo, run

python ./python-demo/demo.py \
  --ssl-enabled=true \
  --endpoints=172.28.0.1:8529,172.28.0.1:8539,172.28.0.1:8549

To run it against an Oasis deployment, run

python ./python-demo/demo.py \
  --password=<root-password> \
  --endpoints=<endpoint> \
  --ssl-enabled=true \
  --ssl-cert-value=<base64-encoded-cert>