-
Notifications
You must be signed in to change notification settings - Fork 60
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Create a databricks-iris starter that enables packaged deployment on …
…Databricks (#129) * Move all downstream datasets to FileStore Signed-off-by: Jannic Holzer <[email protected]> * Add databricks-specific logging settings Signed-off-by: Jannic Holzer <[email protected]> * Move databricks_run.py Signed-off-by: Jannic Holzer <[email protected]> * Modify databricks logging path Signed-off-by: Jannic Holzer <[email protected]> * Lint Signed-off-by: Jannic Holzer <[email protected]> * Lint Signed-off-by: Jannic Holzer <[email protected]> * Revert all Databricks-related changes to pyspark-iris Signed-off-by: Jannic Holzer <[email protected]> * Add databricks-iris starter Signed-off-by: Jannic Holzer <[email protected]> * Update readme Signed-off-by: Jannic Holzer <[email protected]> * Update databricks-iris/{{ cookiecutter.repo_name }}/conf/base/logging.yml Co-authored-by: Nok Lam Chan <[email protected]> * Remove __main__.py Signed-off-by: Jannic Holzer <[email protected]> * Fix links Signed-off-by: Jannic Holzer <[email protected]> --------- Signed-off-by: Jannic Holzer <[email protected]> Co-authored-by: Nok Lam Chan <[email protected]>
- Loading branch information
Showing
41 changed files
with
1,409 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,155 @@ | ||
########################## | ||
# KEDRO PROJECT | ||
|
||
# ignore all local configuration | ||
conf/local/** | ||
!conf/local/.gitkeep | ||
.telemetry | ||
|
||
# ignore potentially sensitive credentials files | ||
conf/**/*credentials* | ||
|
||
# ignore everything in the following folders | ||
data/** | ||
logs/** | ||
|
||
# except their sub-folders | ||
!data/**/ | ||
!logs/**/ | ||
|
||
# also keep all .gitkeep files | ||
!.gitkeep | ||
|
||
# also keep the example dataset | ||
!data/01_raw/iris.csv | ||
|
||
|
||
########################## | ||
# Common files | ||
|
||
# IntelliJ | ||
.idea/ | ||
*.iml | ||
out/ | ||
.idea_modules/ | ||
|
||
### macOS | ||
*.DS_Store | ||
.AppleDouble | ||
.LSOverride | ||
.Trashes | ||
|
||
# Vim | ||
*~ | ||
.*.swo | ||
.*.swp | ||
|
||
# emacs | ||
*~ | ||
\#*\# | ||
/.emacs.desktop | ||
/.emacs.desktop.lock | ||
*.elc | ||
|
||
# JIRA plugin | ||
atlassian-ide-plugin.xml | ||
|
||
# C extensions | ||
*.so | ||
|
||
### Python template | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
.hypothesis/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
.static_storage/ | ||
.media/ | ||
local_settings.py | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# celery beat schedule file | ||
celerybeat-schedule | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.envrc | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# The `databricks-iris` Kedro starter | ||
|
||
## Introduction | ||
|
||
The code in this repository demonstrates best practice when working with Kedro and PySpark on Databricks. It contains a Kedro starter template with some initial configuration and an example pipeline, it accompanies the documentation on [developing and deploying Kedro projects on Databricks](https://docs.kedro.org/en/stable/integrations/index.html#databricks-integration). | ||
|
||
This repository is a fork of the `pyspark-iris` starter that has been modified to run natively on Databricks. | ||
|
||
## Getting started | ||
|
||
The starter template can be used to start a new project using the [`starter` option](https://docs.kedro.org/en/stable/kedro_project_setup/starters.html) in `kedro new`: | ||
|
||
```bash | ||
kedro new --starter=databricks-iris | ||
``` | ||
|
||
## Features | ||
|
||
### Configuration for Databricks in `conf/base` | ||
|
||
This starter has a base configuration that allows it to run natively on Databricks. Directories to store data and logs still need to be manually created in the user's Databricks DBFS instance: | ||
|
||
```bash | ||
/dbfs/FileStore/iris-databricks/data | ||
/dbfs/FileStore/iris-databricks/logs | ||
``` | ||
|
||
See the documentation on deploying a packaged Kedro project to Databricks for more information. | ||
|
||
### Databricks entry point | ||
|
||
The starter contains a script and an entry point (`databricks_run.py`) that enables a packaged project created with this starter to run on Databricks. See the documentation on deploying a packaged Kedro project to Databricks for more information. | ||
|
||
### Single configuration in `/conf/base/spark.yml` | ||
|
||
While Spark allows you to specify many different [configuration options](https://spark.apache.org/docs/latest/configuration.html), this starter uses `/conf/base/spark.yml` as a single configuration location. | ||
|
||
### `SparkSession` initialisation | ||
|
||
This Kedro starter contains the initialisation code for `SparkSession` in the `ProjectContext` and takes its configuration from `/conf/base/spark.yml`. Modify this code if you want to further customise your `SparkSession`, e.g. to use [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). | ||
|
||
### Configures `MemoryDataSet` to work with Spark objects | ||
|
||
Out of the box, Kedro's `MemoryDataSet` works with Spark's `DataFrame`. However, it doesn't work with other Spark objects such as machine learning models unless you add further configuration. This Kedro starter demonstrates how to configure `MemoryDataSet` for Spark's machine learning model in the `catalog.yml`. | ||
|
||
> Note: The use of `MemoryDataSet` is encouraged to propagate Spark's `DataFrame` between nodes in the pipeline. A best practice is to delay triggering Spark actions for as long as needed to take advantage of Spark's lazy evaluation. | ||
### An example machine learning pipeline that uses only `PySpark` and `Kedro` | ||
|
||
![Iris Pipeline Visualisation](./images/iris_pipeline.png) | ||
|
||
This Kedro starter uses the simple and familiar [Iris dataset](https://www.kaggle.com/uciml/iris). It contains the code for an example machine learning pipeline that runs a 1-nearest neighbour classifier to classify an iris. | ||
[Transcoding](https://kedro.readthedocs.io/en/stable/data/data_catalog.html#transcoding-datasets) is used to convert the Spark Dataframes into pandas DataFrames after splitting the data into training and testing sets. | ||
|
||
The pipeline includes: | ||
|
||
* A node to split the data into training dataset and testing dataset using a configurable ratio | ||
* A node to run a simple 1-nearest neighbour classifier and make predictions | ||
* A node to report the accuracy of the predictions performed by the model |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"project_name": "Iris", | ||
"repo_name": "{{ cookiecutter.project_name.strip().replace(' ', '-').replace('_', '-').lower() }}", | ||
"python_package": "{{ cookiecutter.project_name.strip().replace(' ', '_').replace('-', '_').lower() }}", | ||
"kedro_version": "{{ cookiecutter.kedro_version }}" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Here you can define credentials for different data sets and environment. | ||
# | ||
# THIS FILE MUST BE PLACED IN `conf/local`. DO NOT PUSH THIS FILE TO GitHub. | ||
# | ||
# Example: | ||
# | ||
# dev_s3: | ||
# client_kwargs: | ||
# aws_access_key_id: token | ||
# aws_secret_access_key: key | ||
# | ||
# prod_s3: | ||
# aws_access_key_id: token | ||
# aws_secret_access_key: key | ||
# | ||
# dev_sql: | ||
# username: admin | ||
# password: admin |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
project_name: | ||
title: "Project Name" | ||
text: | | ||
Please enter a human readable name for your new project. | ||
Spaces, hyphens, and underscores are allowed. | ||
regex_validator: "^[\\w -]{2,}$" | ||
error_message: | | ||
It must contain only alphanumeric symbols, spaces, underscores and hyphens and | ||
be at least 2 characters long. |
Oops, something went wrong.