Skip to content

martinprobson/Spark-Python3-Example

Repository files navigation

Python Spark Example

Summary

Example Python (3.5) Spark application. Code performs the following actions: -

  1. Configure and connect to local Spark instance.
  2. Load two JSON format files into Spark RDDs.
  3. Define and apply a schema to the RDDs to create Spark DataFrames.
  4. Create temporary SQL views of the data.
  5. Perform join between two datasets and output results to console.

Setup

Code is designed to be run in a conda virtual environment. To setup and configure: -

  1. Run conda env create from project base directory to setup the conda virtual environment.
  2. Code needs a local Spark installation to run against, environment variable SPARK_HOME should be set and point to this location.
  3. Run the code via python main.py.

Files

Name Description
data directory containing employee and titles json datasets
environment.yml conda virtual environment specification
logging.json Logging configuration
main.py Main Python code
pyspark exmaple.ipynb Jupyter notebook containing same code example
utils Utility modules

Note

  • virtualenv activation and deactivation is controlled automatically via autoenv command which executes .env and .env.leave scripts. See autoenv
  • Repo also contains a jupyter notebook containing the same code. Use jupyter notebook to start a notebook server.

References

To Do

  • Needs proper unittest suite.

Martin Robson 23/11/2017.