Create a databricks-iris starter that enables packaged deployment on …

…Databricks (#129) * Move all downstream datasets to FileStore Signed-off-by: Jannic Holzer <[email protected]> * Add databricks-specific logging settings Signed-off-by: Jannic Holzer <[email protected]> * Move databricks_run.py Signed-off-by: Jannic Holzer <[email protected]> * Modify databricks logging path Signed-off-by: Jannic Holzer <[email protected]> * Lint Signed-off-by: Jannic Holzer <[email protected]> * Lint Signed-off-by: Jannic Holzer <[email protected]> * Revert all Databricks-related changes to pyspark-iris Signed-off-by: Jannic Holzer <[email protected]> * Add databricks-iris starter Signed-off-by: Jannic Holzer <[email protected]> * Update readme Signed-off-by: Jannic Holzer <[email protected]> * Update databricks-iris/{{ cookiecutter.repo_name }}/conf/base/logging.yml Co-authored-by: Nok Lam Chan <[email protected]> * Remove __main__.py Signed-off-by: Jannic Holzer <[email protected]> * Fix links Signed-off-by: Jannic Holzer <[email protected]> --------- Signed-off-by: Jannic Holzer <[email protected]> Co-authored-by: Nok Lam Chan <[email protected]>
kedro-org · Jun 1, 2023 · c6a1f30 · c6a1f30
1 parent b212afe
commit c6a1f30
Show file tree

Hide file tree

Showing 41 changed files with 1,409 additions and 8 deletions.
diff --git a/databricks-iris/.gitignore b/databricks-iris/.gitignore
@@ -0,0 +1,155 @@
+##########################
+# KEDRO PROJECT
+
+# ignore all local configuration
+conf/local/**
+!conf/local/.gitkeep
+.telemetry
+
+# ignore potentially sensitive credentials files
+conf/**/*credentials*
+
+# ignore everything in the following folders
+data/**
+logs/**
+
+# except their sub-folders
+!data/**/
+!logs/**/
+
+# also keep all .gitkeep files
+!.gitkeep
+
+# also keep the example dataset
+!data/01_raw/iris.csv
+
+
+##########################
+# Common files
+
+# IntelliJ
+.idea/
+*.iml
+out/
+.idea_modules/
+
+### macOS
+*.DS_Store
+.AppleDouble
+.LSOverride
+.Trashes
+
+# Vim
+*~
+.*.swo
+.*.swp
+
+# emacs
+*~
+\#*\#
+/.emacs.desktop
+/.emacs.desktop.lock
+*.elc
+
+# JIRA plugin
+atlassian-ide-plugin.xml
+
+# C extensions
+*.so
+
+### Python template
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+.static_storage/
+.media/
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.envrc
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
diff --git a/databricks-iris/README.md b/databricks-iris/README.md
@@ -0,0 +1,59 @@
+# The `databricks-iris` Kedro starter
+
+## Introduction
+
+The code in this repository demonstrates best practice when working with Kedro and PySpark on Databricks. It contains a Kedro starter template with some initial configuration and an example pipeline, it accompanies the documentation on [developing and deploying Kedro projects on Databricks](https://docs.kedro.org/en/stable/integrations/index.html#databricks-integration).
+
+This repository is a fork of the `pyspark-iris` starter that has been modified to run natively on Databricks.
+
+## Getting started
+
+The starter template can be used to start a new project using the [`starter` option](https://docs.kedro.org/en/stable/kedro_project_setup/starters.html) in `kedro new`:
+
+```bash
+kedro new --starter=databricks-iris
+```
+
+## Features
+
+### Configuration for Databricks in `conf/base`
+
+This starter has a base configuration that allows it to run natively on Databricks. Directories to store data and logs still need to be manually created in the user's Databricks DBFS instance:
+
+```bash
+/dbfs/FileStore/iris-databricks/data
+/dbfs/FileStore/iris-databricks/logs
+```
+
+See the documentation on deploying a packaged Kedro project to Databricks for more information.
+
+### Databricks entry point
+
+The starter contains a script and an entry point (`databricks_run.py`) that enables a packaged project created with this starter to run on Databricks. See the documentation on deploying a packaged Kedro project to Databricks for more information.
+
+### Single configuration in `/conf/base/spark.yml`
+
+While Spark allows you to specify many different [configuration options](https://spark.apache.org/docs/latest/configuration.html), this starter uses `/conf/base/spark.yml` as a single configuration location.
+
+### `SparkSession` initialisation
+
+This Kedro starter contains the initialisation code for `SparkSession` in the `ProjectContext` and takes its configuration from `/conf/base/spark.yml`. Modify this code if you want to further customise your `SparkSession`, e.g. to use [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).
+
+### Configures `MemoryDataSet` to work with Spark objects
+
+Out of the box, Kedro's `MemoryDataSet` works with Spark's `DataFrame`. However, it doesn't work with other Spark objects such as machine learning models unless you add further configuration. This Kedro starter demonstrates how to configure `MemoryDataSet` for Spark's machine learning model in the `catalog.yml`.
+
+> Note: The use of `MemoryDataSet` is encouraged to propagate Spark's `DataFrame` between nodes in the pipeline. A best practice is to delay triggering Spark actions for as long as needed to take advantage of Spark's lazy evaluation.
+
+### An example machine learning pipeline that uses only `PySpark` and `Kedro`
+
+![Iris Pipeline Visualisation](./images/iris_pipeline.png)
+
+This Kedro starter uses the simple and familiar [Iris dataset](https://www.kaggle.com/uciml/iris). It contains the code for an example machine learning pipeline that runs a 1-nearest neighbour classifier to classify an iris.
+[Transcoding](https://kedro.readthedocs.io/en/stable/data/data_catalog.html#transcoding-datasets) is used to convert the Spark Dataframes into pandas DataFrames after splitting the data into training and testing sets.
+
+The pipeline includes:
+
+* A node to split the data into training dataset and testing dataset using a configurable ratio
+* A node to run a simple 1-nearest neighbour classifier and make predictions
+* A node to report the accuracy of the predictions performed by the model
diff --git a/databricks-iris/cookiecutter.json b/databricks-iris/cookiecutter.json
@@ -0,0 +1,6 @@
+{
+    "project_name": "Iris",
+    "repo_name": "{{ cookiecutter.project_name.strip().replace(' ', '-').replace('_', '-').lower() }}",
+    "python_package": "{{ cookiecutter.project_name.strip().replace(' ', '_').replace('-', '_').lower() }}",
+    "kedro_version": "{{ cookiecutter.kedro_version }}"
+}
diff --git a/databricks-iris/credentials.yml b/databricks-iris/credentials.yml
@@ -0,0 +1,18 @@
+# Here you can define credentials for different data sets and environment.
+#
+# THIS FILE MUST BE PLACED IN `conf/local`. DO NOT PUSH THIS FILE TO GitHub.
+#
+# Example:
+#
+# dev_s3:
+#     client_kwargs:
+#         aws_access_key_id: token
+#         aws_secret_access_key: key
+#
+# prod_s3:
+#     aws_access_key_id: token
+#     aws_secret_access_key: key
+#
+# dev_sql:
+#     username: admin
+#     password: admin
diff --git a/databricks-iris/images/iris_pipeline.png b/databricks-iris/images/iris_pipeline.png
diff --git a/databricks-iris/prompts.yml b/databricks-iris/prompts.yml
@@ -0,0 +1,9 @@
+project_name:
+  title: "Project Name"
+  text: |
+    Please enter a human readable name for your new project.
+    Spaces, hyphens, and underscores are allowed.
+  regex_validator: "^[\\w -]{2,}$"
+  error_message: |
+    It must contain only alphanumeric symbols, spaces, underscores and hyphens and
+    be at least 2 characters long.