Skip to content

Commit

Permalink
Create a databricks-iris starter that enables packaged deployment on …
Browse files Browse the repository at this point in the history
…Databricks (#129)

* Move all downstream datasets to FileStore

Signed-off-by: Jannic Holzer <[email protected]>

* Add databricks-specific logging settings

Signed-off-by: Jannic Holzer <[email protected]>

* Move databricks_run.py

Signed-off-by: Jannic Holzer <[email protected]>

* Modify databricks logging path

Signed-off-by: Jannic Holzer <[email protected]>

* Lint

Signed-off-by: Jannic Holzer <[email protected]>

* Lint

Signed-off-by: Jannic Holzer <[email protected]>

* Revert all Databricks-related changes to pyspark-iris

Signed-off-by: Jannic Holzer <[email protected]>

* Add databricks-iris starter

Signed-off-by: Jannic Holzer <[email protected]>

* Update readme

Signed-off-by: Jannic Holzer <[email protected]>

* Update databricks-iris/{{ cookiecutter.repo_name }}/conf/base/logging.yml

Co-authored-by: Nok Lam Chan <[email protected]>

* Remove __main__.py

Signed-off-by: Jannic Holzer <[email protected]>

* Fix links

Signed-off-by: Jannic Holzer <[email protected]>

---------

Signed-off-by: Jannic Holzer <[email protected]>
Co-authored-by: Nok Lam Chan <[email protected]>
  • Loading branch information
jmholzer and noklam authored Jun 1, 2023
1 parent b212afe commit c6a1f30
Show file tree
Hide file tree
Showing 41 changed files with 1,409 additions and 8 deletions.
155 changes: 155 additions & 0 deletions databricks-iris/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
##########################
# KEDRO PROJECT

# ignore all local configuration
conf/local/**
!conf/local/.gitkeep
.telemetry

# ignore potentially sensitive credentials files
conf/**/*credentials*

# ignore everything in the following folders
data/**
logs/**

# except their sub-folders
!data/**/
!logs/**/

# also keep all .gitkeep files
!.gitkeep

# also keep the example dataset
!data/01_raw/iris.csv


##########################
# Common files

# IntelliJ
.idea/
*.iml
out/
.idea_modules/

### macOS
*.DS_Store
.AppleDouble
.LSOverride
.Trashes

# Vim
*~
.*.swo
.*.swp

# emacs
*~
\#*\#
/.emacs.desktop
/.emacs.desktop.lock
*.elc

# JIRA plugin
atlassian-ide-plugin.xml

# C extensions
*.so

### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
.static_storage/
.media/
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.envrc
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# mkdocs documentation
/site

# mypy
.mypy_cache/
59 changes: 59 additions & 0 deletions databricks-iris/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# The `databricks-iris` Kedro starter

## Introduction

The code in this repository demonstrates best practice when working with Kedro and PySpark on Databricks. It contains a Kedro starter template with some initial configuration and an example pipeline, it accompanies the documentation on [developing and deploying Kedro projects on Databricks](https://docs.kedro.org/en/stable/integrations/index.html#databricks-integration).

This repository is a fork of the `pyspark-iris` starter that has been modified to run natively on Databricks.

## Getting started

The starter template can be used to start a new project using the [`starter` option](https://docs.kedro.org/en/stable/kedro_project_setup/starters.html) in `kedro new`:

```bash
kedro new --starter=databricks-iris
```

## Features

### Configuration for Databricks in `conf/base`

This starter has a base configuration that allows it to run natively on Databricks. Directories to store data and logs still need to be manually created in the user's Databricks DBFS instance:

```bash
/dbfs/FileStore/iris-databricks/data
/dbfs/FileStore/iris-databricks/logs
```

See the documentation on deploying a packaged Kedro project to Databricks for more information.

### Databricks entry point

The starter contains a script and an entry point (`databricks_run.py`) that enables a packaged project created with this starter to run on Databricks. See the documentation on deploying a packaged Kedro project to Databricks for more information.

### Single configuration in `/conf/base/spark.yml`

While Spark allows you to specify many different [configuration options](https://spark.apache.org/docs/latest/configuration.html), this starter uses `/conf/base/spark.yml` as a single configuration location.

### `SparkSession` initialisation

This Kedro starter contains the initialisation code for `SparkSession` in the `ProjectContext` and takes its configuration from `/conf/base/spark.yml`. Modify this code if you want to further customise your `SparkSession`, e.g. to use [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).

### Configures `MemoryDataSet` to work with Spark objects

Out of the box, Kedro's `MemoryDataSet` works with Spark's `DataFrame`. However, it doesn't work with other Spark objects such as machine learning models unless you add further configuration. This Kedro starter demonstrates how to configure `MemoryDataSet` for Spark's machine learning model in the `catalog.yml`.

> Note: The use of `MemoryDataSet` is encouraged to propagate Spark's `DataFrame` between nodes in the pipeline. A best practice is to delay triggering Spark actions for as long as needed to take advantage of Spark's lazy evaluation.
### An example machine learning pipeline that uses only `PySpark` and `Kedro`

![Iris Pipeline Visualisation](./images/iris_pipeline.png)

This Kedro starter uses the simple and familiar [Iris dataset](https://www.kaggle.com/uciml/iris). It contains the code for an example machine learning pipeline that runs a 1-nearest neighbour classifier to classify an iris.
[Transcoding](https://kedro.readthedocs.io/en/stable/data/data_catalog.html#transcoding-datasets) is used to convert the Spark Dataframes into pandas DataFrames after splitting the data into training and testing sets.

The pipeline includes:

* A node to split the data into training dataset and testing dataset using a configurable ratio
* A node to run a simple 1-nearest neighbour classifier and make predictions
* A node to report the accuracy of the predictions performed by the model
6 changes: 6 additions & 0 deletions databricks-iris/cookiecutter.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"project_name": "Iris",
"repo_name": "{{ cookiecutter.project_name.strip().replace(' ', '-').replace('_', '-').lower() }}",
"python_package": "{{ cookiecutter.project_name.strip().replace(' ', '_').replace('-', '_').lower() }}",
"kedro_version": "{{ cookiecutter.kedro_version }}"
}
18 changes: 18 additions & 0 deletions databricks-iris/credentials.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Here you can define credentials for different data sets and environment.
#
# THIS FILE MUST BE PLACED IN `conf/local`. DO NOT PUSH THIS FILE TO GitHub.
#
# Example:
#
# dev_s3:
# client_kwargs:
# aws_access_key_id: token
# aws_secret_access_key: key
#
# prod_s3:
# aws_access_key_id: token
# aws_secret_access_key: key
#
# dev_sql:
# username: admin
# password: admin
Binary file added databricks-iris/images/iris_pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions databricks-iris/prompts.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
project_name:
title: "Project Name"
text: |
Please enter a human readable name for your new project.
Spaces, hyphens, and underscores are allowed.
regex_validator: "^[\\w -]{2,}$"
error_message: |
It must contain only alphanumeric symbols, spaces, underscores and hyphens and
be at least 2 characters long.
Loading

0 comments on commit c6a1f30

Please sign in to comment.