Papermill Tutorial (#345)

* Papermill tutorial Signed-off-by: Samhita Alla <[email protected]> * github action, tests Signed-off-by: Samhita Alla <[email protected]> * docs-related change Signed-off-by: Samhita Alla <[email protected]> * dataclass Signed-off-by: Samhita Alla <[email protected]>
flyteorg · Aug 13, 2021 · 86e3e08 · 86e3e08
1 parent 54656b8
commit 86e3e08
Show file tree

Hide file tree

Showing 17 changed files with 4,398 additions and 38 deletions.
diff --git a/.github/workflows/ghcr_push.yml b/.github/workflows/ghcr_push.yml
@@ -43,6 +43,8 @@ jobs:
             path: case_studies/ml_training
           - name: mnist_classifier
             path: case_studies/ml_training
+          - name: eda
+            path: case_studies/feature_engineering
     steps:
       - uses: actions/checkout@v2
         with:

diff --git a/.github/workflows/sphinx_build.yml b/.github/workflows/sphinx_build.yml
diff --git a/cookbook/case_studies/feature_engineering/eda/Dockerfile b/cookbook/case_studies/feature_engineering/eda/Dockerfile
@@ -0,0 +1,45 @@
+FROM ubuntu:focal
+
+WORKDIR /root
+ENV VENV /opt/venv
+ENV LANG C.UTF-8
+ENV LC_ALL C.UTF-8
+ENV PYTHONPATH /root
+
+RUN : \
+    && apt-get update \
+    && apt install -y software-properties-common \
+    && add-apt-repository ppa:deadsnakes/ppa
+
+RUN : \
+    && apt-get update \
+    && apt-get install -y python3.8 python3-pip python3-venv make build-essential libssl-dev curl vim
+
+# This is necessary for opencv to work
+RUN apt-get update && apt-get install -y libsm6 libxext6 libxrender-dev ffmpeg
+
+# Install the AWS cli separately to prevent issues with boto being written over
+RUN pip3 install awscli
+
+# Virtual environment
+RUN python3.8 -m venv ${VENV}
+RUN ${VENV}/bin/pip install wheel
+
+# Install Python dependencies
+COPY eda/requirements.txt /root
+RUN ${VENV}/bin/pip install -r /root/requirements.txt
+
+# Copy the actual code
+COPY eda/ /root/eda/
+
+# Copy over the helper script that the SDK relies on
+RUN cp ${VENV}/bin/flytekit_venv /usr/local/bin/
+RUN chmod a+x /usr/local/bin/flytekit_venv
+
+# This tag is supplied by the build script and will be used to determine the version
+# when registering tasks, workflows, and launch plans
+ARG tag
+ENV FLYTE_INTERNAL_IMAGE $tag
+
+# Enable the virtualenv for this image. Note this relies on the VENV variable we've set in this image.
+ENTRYPOINT ["/usr/local/bin/flytekit_venv"]
diff --git a/cookbook/case_studies/feature_engineering/eda/Makefile b/cookbook/case_studies/feature_engineering/eda/Makefile
@@ -0,0 +1,3 @@
+PREFIX=eda
+include ../../../common/Makefile
+include ../../../common/leaf.mk
diff --git a/cookbook/case_studies/feature_engineering/eda/README.rst b/cookbook/case_studies/feature_engineering/eda/README.rst
@@ -0,0 +1,58 @@
+EDA, Feature Engineering, and Modeling With Papermill
+=====================================================
+
+Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data to discover patterns, 
+spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations.
+
+EDA cannot be solely implemented within Flyte as it requires visual analysis of the data. 
+In such scenarios, we are inclined towards using a Jupyter notebook as it helps visualize and feature engineer the data.
+
+**Now the question is, how do we leverage the power of Jupyter Notebook within Flyte to perform EDA on the data?**
+
+Papermill
+---------
+
+`Papermill <https://papermill.readthedocs.io/en/latest/>`__ is a tool for parameterizing and executing Jupyter Notebooks. 
+Papermill lets you:
+
+- parameterize notebooks
+- execute notebooks
+
+We have a pre-packaged version of Papermill with Flyte that lets you leverage the power of Jupyter Notebook within Flyte pipelines.
+
+To install the plugin, run the following command:
+
+.. prompt:: bash $
+
+    pip install flytekitplugins-papermill
+
+Examples
+--------
+
+There are three code examples that you can refer to in this tutorial:
+
+- Run the whole pipeline (EDA + Feature Engineering + Modeling) in one notebook
+- Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset), 
+  and model the data as a Flyte task by sending the dataset as an argument
+- Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset), 
+  and model the data in another notebook by sending the dataset as an argument
+
+Notebook Etiquette
+^^^^^^^^^^^^^^^^^^
+
+- If you want to send inputs and receive outputs, your Jupyter notebook has to have ``parameters`` and ``outputs`` tags, respectively. 
+  To set up tags in a notebook, follow this `guide <https://jupyterbook.org/content/metadata.html#adding-tags-using-notebook-interfaces>`__.
+- ``parameters`` cell must only have the input variables.
+- ``outputs`` cell looks like the following:
+
+  .. code-block:: python
+
+    from flytekitplugins.papermill import record_outputs
+    record_outputs(variable_name=variable_name)
+  
+  Of course, you can have any number of variables!
+- The ``inputs`` and ``outputs`` variable names in the ``NotebookTask`` must match the variable names in the notebook.
+
+.. note::
+  You will see three outputs on running the Python code files, although a single output is returned. 
+  One output is the executed notebook, and the other is the rendered HTML of the notebook.
diff --git a/cookbook/case_studies/feature_engineering/eda/__init__.py b/cookbook/case_studies/feature_engineering/eda/__init__.py
diff --git a/cookbook/case_studies/feature_engineering/eda/notebook.py b/cookbook/case_studies/feature_engineering/eda/notebook.py
@@ -0,0 +1,72 @@
+"""
+Flyte Pipeline in One Jupyter Notebook
+=======================================
+
+In this example, we will implement a simple pipeline that takes hyperparameters, does EDA, feature engineering, and measures the Gradient 
+Boosting model's performace using mean absolute error (MAE), all in one notebook.
+"""
+
+# %%
+# First, let's import the libraries we will use in this example.
+import os
+import pathlib
+
+from flytekit import Resources, kwtypes, workflow
+from flytekitplugins.papermill import NotebookTask
+
+# %%
+# We define a ``NotebookTask`` to run the Jupyter notebook.
+#
+# .. list-table:: ``NotebookTask`` Parameters
+#    :widths: 25 25
+#
+#    * - ``notebook_path``
+#      - Path to the Jupyter notebook file
+#    * - ``inputs``
+#      - Inputs to be sent to the notebook
+#    * - ``outputs``
+#      - Outputs to be returned from the notebook
+#    * - ``requests``
+#      - Specify compute resource requests for your task.
+#
+# This notebook returns ``mae_score`` as the output.
+nb = NotebookTask(
+    name="pipeline-nb",
+    notebook_path=os.path.join(
+        pathlib.Path(__file__).parent.absolute(), "supermarket_regression.ipynb"
+    ),
+    inputs=kwtypes(
+        n_estimators=int,
+        max_depth=int,
+        max_features=str,
+        min_samples_split=int,
+        random_state=int,
+    ),
+    outputs=kwtypes(mae_score=float),
+    requests=Resources(mem="500Mi"),
+)
+
+# %%
+# Since a task need not be defined, we create a ``workflow`` and return the MAE score.
+@workflow
+def notebook_wf(
+    n_estimators: int = 150,
+    max_depth: int = 3,
+    max_features: str = "sqrt",
+    min_samples_split: int = 4,
+    random_state: int = 2,
+) -> float:
+    output = nb(
+        n_estimators=n_estimators,
+        max_depth=max_depth,
+        max_features=max_features,
+        min_samples_split=min_samples_split,
+        random_state=random_state,
+    )
+    return output.mae_score
+
+
+# %%
+# We can now run the notebook locally.
+if __name__ == "__main__":
+    print(notebook_wf())
diff --git a/cookbook/case_studies/feature_engineering/eda/notebook_and_task.py b/cookbook/case_studies/feature_engineering/eda/notebook_and_task.py
@@ -0,0 +1,106 @@
+"""
+EDA and Feature Engineering in Jupyter Notebook and Modeling in a Flyte Task
+============================================================================
+
+In this example, we will implement a simple pipeline that takes hyperparameters, does EDA, feature engineering 
+(step 1: EDA and feature engineering in notebook), and measures the Gradient Boosting model's performace using mean absolute error (MAE) 
+(step 2: Modeling in a Flyte Task). 
+"""
+
+# %%
+# First, let's import the libraries we will use in this example.
+import os
+import pathlib
+from dataclasses import dataclass
+
+import numpy as np
+import pandas as pd
+from dataclasses_json import dataclass_json
+from flytekit import Resources, kwtypes, task, workflow
+from flytekitplugins.papermill import NotebookTask
+from sklearn.ensemble import GradientBoostingRegressor
+from sklearn.model_selection import cross_val_score, train_test_split
+from sklearn.preprocessing import RobustScaler
+
+
+# %%
+# We define a ``dataclass`` to store the hyperparameters of the Gradient Boosting Regressor.
+@dataclass_json
+@dataclass
+class Hyperparameters(object):
+    n_estimators: int = 150
+    max_depth: int = 3
+    max_features: str = "sqrt"
+    min_samples_split: int = 4
+    random_state: int = 2
+    nfolds: int = 10
+
+
+# %%
+# We define a ``NotebookTask`` to run the Jupyter notebook.
+# This notebook returns ``dummified_data`` and ``dataset`` as the outputs.
+#
+# .. note::
+#   ``dummified_data`` is used in this example, and ``dataset`` is used in the upcoming example.
+nb = NotebookTask(
+    name="eda-feature-eng-nb",
+    notebook_path=os.path.join(
+        pathlib.Path(__file__).parent.absolute(), "supermarket_regression_1.ipynb"
+    ),
+    outputs=kwtypes(dummified_data=pd.DataFrame, dataset=str),
+    requests=Resources(mem="500Mi"),
+)
+
+# %%
+# Next, we define a ``cross_validate`` function and a ``modeling`` task to compute the MAE score of the data against
+# the Gradient Boosting Regressor.
+def cross_validate(model, nfolds, feats, targets):
+    score = -1 * (
+        cross_val_score(
+            model, feats, targets, cv=nfolds, scoring="neg_mean_absolute_error"
+        )
+    )
+    return np.mean(score)
+
+
+@task
+def modeling(
+    dataset: pd.DataFrame,
+    hyperparams: Hyperparameters,
+) -> float:
+    y_target = dataset["Product_Supermarket_Sales"].tolist()
+    dataset.drop(["Product_Supermarket_Sales"], axis=1, inplace=True)
+
+    X_train, X_test, y_train, _ = train_test_split(dataset, y_target, test_size=0.3)
+
+    scaler = RobustScaler()
+
+    scaler.fit(X_train)
+
+    X_train = scaler.transform(X_train)
+    X_test = scaler.transform(X_test)
+
+    gb_model = GradientBoostingRegressor(
+        n_estimators=hyperparams.n_estimators,
+        max_depth=hyperparams.max_depth,
+        max_features=hyperparams.max_features,
+        min_samples_split=hyperparams.min_samples_split,
+        random_state=hyperparams.random_state,
+    )
+
+    return cross_validate(gb_model, hyperparams.nfolds, X_train, y_train)
+
+
+# %%
+# We define a ``workflow`` to run the notebook and the ``modeling`` task.
+@workflow
+def notebook_wf(hyperparams: Hyperparameters = Hyperparameters()) -> float:
+    output = nb()
+    mae_score = modeling(dataset=output.dummified_data, hyperparams=hyperparams)
+    return mae_score
+
+
+# %%
+# We can now run the notebook and the modeling task locally.
+if __name__ == "__main__":
+    print(notebook_wf())