Skip to content

Commit

Permalink
Papermill Tutorial (#345)
Browse files Browse the repository at this point in the history
* Papermill tutorial

Signed-off-by: Samhita Alla <[email protected]>

* github action, tests

Signed-off-by: Samhita Alla <[email protected]>

* docs-related change

Signed-off-by: Samhita Alla <[email protected]>

* dataclass

Signed-off-by: Samhita Alla <[email protected]>
  • Loading branch information
samhita-alla authored Aug 13, 2021
1 parent 54656b8 commit 86e3e08
Show file tree
Hide file tree
Showing 17 changed files with 4,398 additions and 38 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/ghcr_push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ jobs:
path: case_studies/ml_training
- name: mnist_classifier
path: case_studies/ml_training
- name: eda
path: case_studies/feature_engineering
steps:
- uses: actions/checkout@v2
with:
Expand Down
38 changes: 0 additions & 38 deletions .github/workflows/sphinx_build.yml

This file was deleted.

45 changes: 45 additions & 0 deletions cookbook/case_studies/feature_engineering/eda/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
FROM ubuntu:focal

WORKDIR /root
ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /root

RUN : \
&& apt-get update \
&& apt install -y software-properties-common \
&& add-apt-repository ppa:deadsnakes/ppa

RUN : \
&& apt-get update \
&& apt-get install -y python3.8 python3-pip python3-venv make build-essential libssl-dev curl vim

# This is necessary for opencv to work
RUN apt-get update && apt-get install -y libsm6 libxext6 libxrender-dev ffmpeg

# Install the AWS cli separately to prevent issues with boto being written over
RUN pip3 install awscli

# Virtual environment
RUN python3.8 -m venv ${VENV}
RUN ${VENV}/bin/pip install wheel

# Install Python dependencies
COPY eda/requirements.txt /root
RUN ${VENV}/bin/pip install -r /root/requirements.txt

# Copy the actual code
COPY eda/ /root/eda/

# Copy over the helper script that the SDK relies on
RUN cp ${VENV}/bin/flytekit_venv /usr/local/bin/
RUN chmod a+x /usr/local/bin/flytekit_venv

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag

# Enable the virtualenv for this image. Note this relies on the VENV variable we've set in this image.
ENTRYPOINT ["/usr/local/bin/flytekit_venv"]
3 changes: 3 additions & 0 deletions cookbook/case_studies/feature_engineering/eda/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
PREFIX=eda
include ../../../common/Makefile
include ../../../common/leaf.mk
58 changes: 58 additions & 0 deletions cookbook/case_studies/feature_engineering/eda/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
EDA, Feature Engineering, and Modeling With Papermill
=====================================================

Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data to discover patterns,
spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations.

EDA cannot be solely implemented within Flyte as it requires visual analysis of the data.
In such scenarios, we are inclined towards using a Jupyter notebook as it helps visualize and feature engineer the data.

**Now the question is, how do we leverage the power of Jupyter Notebook within Flyte to perform EDA on the data?**

Papermill
---------

`Papermill <https://papermill.readthedocs.io/en/latest/>`__ is a tool for parameterizing and executing Jupyter Notebooks.
Papermill lets you:

- parameterize notebooks
- execute notebooks

We have a pre-packaged version of Papermill with Flyte that lets you leverage the power of Jupyter Notebook within Flyte pipelines.

To install the plugin, run the following command:

.. prompt:: bash $

pip install flytekitplugins-papermill

Examples
--------

There are three code examples that you can refer to in this tutorial:

- Run the whole pipeline (EDA + Feature Engineering + Modeling) in one notebook
- Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset),
and model the data as a Flyte task by sending the dataset as an argument
- Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset),
and model the data in another notebook by sending the dataset as an argument

Notebook Etiquette
^^^^^^^^^^^^^^^^^^

- If you want to send inputs and receive outputs, your Jupyter notebook has to have ``parameters`` and ``outputs`` tags, respectively.
To set up tags in a notebook, follow this `guide <https://jupyterbook.org/content/metadata.html#adding-tags-using-notebook-interfaces>`__.
- ``parameters`` cell must only have the input variables.
- ``outputs`` cell looks like the following:

.. code-block:: python
from flytekitplugins.papermill import record_outputs
record_outputs(variable_name=variable_name)
Of course, you can have any number of variables!
- The ``inputs`` and ``outputs`` variable names in the ``NotebookTask`` must match the variable names in the notebook.

.. note::
You will see three outputs on running the Python code files, although a single output is returned.
One output is the executed notebook, and the other is the rendered HTML of the notebook.
Empty file.
72 changes: 72 additions & 0 deletions cookbook/case_studies/feature_engineering/eda/notebook.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
"""
Flyte Pipeline in One Jupyter Notebook
=======================================
In this example, we will implement a simple pipeline that takes hyperparameters, does EDA, feature engineering, and measures the Gradient
Boosting model's performace using mean absolute error (MAE), all in one notebook.
"""

# %%
# First, let's import the libraries we will use in this example.
import os
import pathlib

from flytekit import Resources, kwtypes, workflow
from flytekitplugins.papermill import NotebookTask

# %%
# We define a ``NotebookTask`` to run the Jupyter notebook.
#
# .. list-table:: ``NotebookTask`` Parameters
# :widths: 25 25
#
# * - ``notebook_path``
# - Path to the Jupyter notebook file
# * - ``inputs``
# - Inputs to be sent to the notebook
# * - ``outputs``
# - Outputs to be returned from the notebook
# * - ``requests``
# - Specify compute resource requests for your task.
#
# This notebook returns ``mae_score`` as the output.
nb = NotebookTask(
name="pipeline-nb",
notebook_path=os.path.join(
pathlib.Path(__file__).parent.absolute(), "supermarket_regression.ipynb"
),
inputs=kwtypes(
n_estimators=int,
max_depth=int,
max_features=str,
min_samples_split=int,
random_state=int,
),
outputs=kwtypes(mae_score=float),
requests=Resources(mem="500Mi"),
)

# %%
# Since a task need not be defined, we create a ``workflow`` and return the MAE score.
@workflow
def notebook_wf(
n_estimators: int = 150,
max_depth: int = 3,
max_features: str = "sqrt",
min_samples_split: int = 4,
random_state: int = 2,
) -> float:
output = nb(
n_estimators=n_estimators,
max_depth=max_depth,
max_features=max_features,
min_samples_split=min_samples_split,
random_state=random_state,
)
return output.mae_score


# %%
# We can now run the notebook locally.
if __name__ == "__main__":
print(notebook_wf())
106 changes: 106 additions & 0 deletions cookbook/case_studies/feature_engineering/eda/notebook_and_task.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
"""
EDA and Feature Engineering in Jupyter Notebook and Modeling in a Flyte Task
============================================================================
In this example, we will implement a simple pipeline that takes hyperparameters, does EDA, feature engineering
(step 1: EDA and feature engineering in notebook), and measures the Gradient Boosting model's performace using mean absolute error (MAE)
(step 2: Modeling in a Flyte Task).
"""

# %%
# First, let's import the libraries we will use in this example.
import os
import pathlib
from dataclasses import dataclass

import numpy as np
import pandas as pd
from dataclasses_json import dataclass_json
from flytekit import Resources, kwtypes, task, workflow
from flytekitplugins.papermill import NotebookTask
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import RobustScaler


# %%
# We define a ``dataclass`` to store the hyperparameters of the Gradient Boosting Regressor.
@dataclass_json
@dataclass
class Hyperparameters(object):
n_estimators: int = 150
max_depth: int = 3
max_features: str = "sqrt"
min_samples_split: int = 4
random_state: int = 2
nfolds: int = 10


# %%
# We define a ``NotebookTask`` to run the Jupyter notebook.
# This notebook returns ``dummified_data`` and ``dataset`` as the outputs.
#
# .. note::
# ``dummified_data`` is used in this example, and ``dataset`` is used in the upcoming example.
nb = NotebookTask(
name="eda-feature-eng-nb",
notebook_path=os.path.join(
pathlib.Path(__file__).parent.absolute(), "supermarket_regression_1.ipynb"
),
outputs=kwtypes(dummified_data=pd.DataFrame, dataset=str),
requests=Resources(mem="500Mi"),
)

# %%
# Next, we define a ``cross_validate`` function and a ``modeling`` task to compute the MAE score of the data against
# the Gradient Boosting Regressor.
def cross_validate(model, nfolds, feats, targets):
score = -1 * (
cross_val_score(
model, feats, targets, cv=nfolds, scoring="neg_mean_absolute_error"
)
)
return np.mean(score)


@task
def modeling(
dataset: pd.DataFrame,
hyperparams: Hyperparameters,
) -> float:
y_target = dataset["Product_Supermarket_Sales"].tolist()
dataset.drop(["Product_Supermarket_Sales"], axis=1, inplace=True)

X_train, X_test, y_train, _ = train_test_split(dataset, y_target, test_size=0.3)

scaler = RobustScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

gb_model = GradientBoostingRegressor(
n_estimators=hyperparams.n_estimators,
max_depth=hyperparams.max_depth,
max_features=hyperparams.max_features,
min_samples_split=hyperparams.min_samples_split,
random_state=hyperparams.random_state,
)

return cross_validate(gb_model, hyperparams.nfolds, X_train, y_train)


# %%
# We define a ``workflow`` to run the notebook and the ``modeling`` task.
@workflow
def notebook_wf(hyperparams: Hyperparameters = Hyperparameters()) -> float:
output = nb()
mae_score = modeling(dataset=output.dummified_data, hyperparams=hyperparams)
return mae_score


# %%
# We can now run the notebook and the modeling task locally.
if __name__ == "__main__":
print(notebook_wf())
Loading

0 comments on commit 86e3e08

Please sign in to comment.