Example Python (3.5) Spark application. Code performs the following actions: -
- Configure and connect to local Spark instance.
- Load two JSON format files into Spark RDDs.
- Define and apply a schema to the RDDs to create Spark DataFrames.
- Create temporary SQL views of the data.
- Perform join between two datasets and output results to console.
Code is designed to be run in a conda virtual environment. To setup and configure: -
- Run
conda env create
from project base directory to setup the conda virtual environment. - Code needs a local Spark installation to run against, environment variable
SPARK_HOME
should be set and point to this location. - Run the code via
python main.py
.
Name | Description |
---|---|
data | directory containing employee and titles json datasets |
environment.yml | conda virtual environment specification |
logging.json | Logging configuration |
main.py | Main Python code |
pyspark exmaple.ipynb | Jupyter notebook containing same code example |
utils | Utility modules |
- virtualenv activation and deactivation is controlled automatically via
autoenv
command which executes.env
and.env.leave
scripts. See autoenv - Repo also contains a jupyter notebook containing the same code. Use
jupyter notebook
to start a notebook server.
- Needs proper unittest suite.
Martin Robson 23/11/2017.