-
Notifications
You must be signed in to change notification settings - Fork 121
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Papermill tutorial Signed-off-by: Samhita Alla <[email protected]> * github action, tests Signed-off-by: Samhita Alla <[email protected]> * docs-related change Signed-off-by: Samhita Alla <[email protected]> * dataclass Signed-off-by: Samhita Alla <[email protected]>
- Loading branch information
1 parent
54656b8
commit 86e3e08
Showing
17 changed files
with
4,398 additions
and
38 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
FROM ubuntu:focal | ||
|
||
WORKDIR /root | ||
ENV VENV /opt/venv | ||
ENV LANG C.UTF-8 | ||
ENV LC_ALL C.UTF-8 | ||
ENV PYTHONPATH /root | ||
|
||
RUN : \ | ||
&& apt-get update \ | ||
&& apt install -y software-properties-common \ | ||
&& add-apt-repository ppa:deadsnakes/ppa | ||
|
||
RUN : \ | ||
&& apt-get update \ | ||
&& apt-get install -y python3.8 python3-pip python3-venv make build-essential libssl-dev curl vim | ||
|
||
# This is necessary for opencv to work | ||
RUN apt-get update && apt-get install -y libsm6 libxext6 libxrender-dev ffmpeg | ||
|
||
# Install the AWS cli separately to prevent issues with boto being written over | ||
RUN pip3 install awscli | ||
|
||
# Virtual environment | ||
RUN python3.8 -m venv ${VENV} | ||
RUN ${VENV}/bin/pip install wheel | ||
|
||
# Install Python dependencies | ||
COPY eda/requirements.txt /root | ||
RUN ${VENV}/bin/pip install -r /root/requirements.txt | ||
|
||
# Copy the actual code | ||
COPY eda/ /root/eda/ | ||
|
||
# Copy over the helper script that the SDK relies on | ||
RUN cp ${VENV}/bin/flytekit_venv /usr/local/bin/ | ||
RUN chmod a+x /usr/local/bin/flytekit_venv | ||
|
||
# This tag is supplied by the build script and will be used to determine the version | ||
# when registering tasks, workflows, and launch plans | ||
ARG tag | ||
ENV FLYTE_INTERNAL_IMAGE $tag | ||
|
||
# Enable the virtualenv for this image. Note this relies on the VENV variable we've set in this image. | ||
ENTRYPOINT ["/usr/local/bin/flytekit_venv"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
PREFIX=eda | ||
include ../../../common/Makefile | ||
include ../../../common/leaf.mk |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
EDA, Feature Engineering, and Modeling With Papermill | ||
===================================================== | ||
|
||
Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data to discover patterns, | ||
spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations. | ||
|
||
EDA cannot be solely implemented within Flyte as it requires visual analysis of the data. | ||
In such scenarios, we are inclined towards using a Jupyter notebook as it helps visualize and feature engineer the data. | ||
|
||
**Now the question is, how do we leverage the power of Jupyter Notebook within Flyte to perform EDA on the data?** | ||
|
||
Papermill | ||
--------- | ||
|
||
`Papermill <https://papermill.readthedocs.io/en/latest/>`__ is a tool for parameterizing and executing Jupyter Notebooks. | ||
Papermill lets you: | ||
|
||
- parameterize notebooks | ||
- execute notebooks | ||
|
||
We have a pre-packaged version of Papermill with Flyte that lets you leverage the power of Jupyter Notebook within Flyte pipelines. | ||
|
||
To install the plugin, run the following command: | ||
|
||
.. prompt:: bash $ | ||
|
||
pip install flytekitplugins-papermill | ||
|
||
Examples | ||
-------- | ||
|
||
There are three code examples that you can refer to in this tutorial: | ||
|
||
- Run the whole pipeline (EDA + Feature Engineering + Modeling) in one notebook | ||
- Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset), | ||
and model the data as a Flyte task by sending the dataset as an argument | ||
- Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset), | ||
and model the data in another notebook by sending the dataset as an argument | ||
|
||
Notebook Etiquette | ||
^^^^^^^^^^^^^^^^^^ | ||
|
||
- If you want to send inputs and receive outputs, your Jupyter notebook has to have ``parameters`` and ``outputs`` tags, respectively. | ||
To set up tags in a notebook, follow this `guide <https://jupyterbook.org/content/metadata.html#adding-tags-using-notebook-interfaces>`__. | ||
- ``parameters`` cell must only have the input variables. | ||
- ``outputs`` cell looks like the following: | ||
|
||
.. code-block:: python | ||
from flytekitplugins.papermill import record_outputs | ||
record_outputs(variable_name=variable_name) | ||
Of course, you can have any number of variables! | ||
- The ``inputs`` and ``outputs`` variable names in the ``NotebookTask`` must match the variable names in the notebook. | ||
|
||
.. note:: | ||
You will see three outputs on running the Python code files, although a single output is returned. | ||
One output is the executed notebook, and the other is the rendered HTML of the notebook. |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
""" | ||
Flyte Pipeline in One Jupyter Notebook | ||
======================================= | ||
In this example, we will implement a simple pipeline that takes hyperparameters, does EDA, feature engineering, and measures the Gradient | ||
Boosting model's performace using mean absolute error (MAE), all in one notebook. | ||
""" | ||
|
||
# %% | ||
# First, let's import the libraries we will use in this example. | ||
import os | ||
import pathlib | ||
|
||
from flytekit import Resources, kwtypes, workflow | ||
from flytekitplugins.papermill import NotebookTask | ||
|
||
# %% | ||
# We define a ``NotebookTask`` to run the Jupyter notebook. | ||
# | ||
# .. list-table:: ``NotebookTask`` Parameters | ||
# :widths: 25 25 | ||
# | ||
# * - ``notebook_path`` | ||
# - Path to the Jupyter notebook file | ||
# * - ``inputs`` | ||
# - Inputs to be sent to the notebook | ||
# * - ``outputs`` | ||
# - Outputs to be returned from the notebook | ||
# * - ``requests`` | ||
# - Specify compute resource requests for your task. | ||
# | ||
# This notebook returns ``mae_score`` as the output. | ||
nb = NotebookTask( | ||
name="pipeline-nb", | ||
notebook_path=os.path.join( | ||
pathlib.Path(__file__).parent.absolute(), "supermarket_regression.ipynb" | ||
), | ||
inputs=kwtypes( | ||
n_estimators=int, | ||
max_depth=int, | ||
max_features=str, | ||
min_samples_split=int, | ||
random_state=int, | ||
), | ||
outputs=kwtypes(mae_score=float), | ||
requests=Resources(mem="500Mi"), | ||
) | ||
|
||
# %% | ||
# Since a task need not be defined, we create a ``workflow`` and return the MAE score. | ||
@workflow | ||
def notebook_wf( | ||
n_estimators: int = 150, | ||
max_depth: int = 3, | ||
max_features: str = "sqrt", | ||
min_samples_split: int = 4, | ||
random_state: int = 2, | ||
) -> float: | ||
output = nb( | ||
n_estimators=n_estimators, | ||
max_depth=max_depth, | ||
max_features=max_features, | ||
min_samples_split=min_samples_split, | ||
random_state=random_state, | ||
) | ||
return output.mae_score | ||
|
||
|
||
# %% | ||
# We can now run the notebook locally. | ||
if __name__ == "__main__": | ||
print(notebook_wf()) |
106 changes: 106 additions & 0 deletions
106
cookbook/case_studies/feature_engineering/eda/notebook_and_task.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
""" | ||
EDA and Feature Engineering in Jupyter Notebook and Modeling in a Flyte Task | ||
============================================================================ | ||
In this example, we will implement a simple pipeline that takes hyperparameters, does EDA, feature engineering | ||
(step 1: EDA and feature engineering in notebook), and measures the Gradient Boosting model's performace using mean absolute error (MAE) | ||
(step 2: Modeling in a Flyte Task). | ||
""" | ||
|
||
# %% | ||
# First, let's import the libraries we will use in this example. | ||
import os | ||
import pathlib | ||
from dataclasses import dataclass | ||
|
||
import numpy as np | ||
import pandas as pd | ||
from dataclasses_json import dataclass_json | ||
from flytekit import Resources, kwtypes, task, workflow | ||
from flytekitplugins.papermill import NotebookTask | ||
from sklearn.ensemble import GradientBoostingRegressor | ||
from sklearn.model_selection import cross_val_score, train_test_split | ||
from sklearn.preprocessing import RobustScaler | ||
|
||
|
||
# %% | ||
# We define a ``dataclass`` to store the hyperparameters of the Gradient Boosting Regressor. | ||
@dataclass_json | ||
@dataclass | ||
class Hyperparameters(object): | ||
n_estimators: int = 150 | ||
max_depth: int = 3 | ||
max_features: str = "sqrt" | ||
min_samples_split: int = 4 | ||
random_state: int = 2 | ||
nfolds: int = 10 | ||
|
||
|
||
# %% | ||
# We define a ``NotebookTask`` to run the Jupyter notebook. | ||
# This notebook returns ``dummified_data`` and ``dataset`` as the outputs. | ||
# | ||
# .. note:: | ||
# ``dummified_data`` is used in this example, and ``dataset`` is used in the upcoming example. | ||
nb = NotebookTask( | ||
name="eda-feature-eng-nb", | ||
notebook_path=os.path.join( | ||
pathlib.Path(__file__).parent.absolute(), "supermarket_regression_1.ipynb" | ||
), | ||
outputs=kwtypes(dummified_data=pd.DataFrame, dataset=str), | ||
requests=Resources(mem="500Mi"), | ||
) | ||
|
||
# %% | ||
# Next, we define a ``cross_validate`` function and a ``modeling`` task to compute the MAE score of the data against | ||
# the Gradient Boosting Regressor. | ||
def cross_validate(model, nfolds, feats, targets): | ||
score = -1 * ( | ||
cross_val_score( | ||
model, feats, targets, cv=nfolds, scoring="neg_mean_absolute_error" | ||
) | ||
) | ||
return np.mean(score) | ||
|
||
|
||
@task | ||
def modeling( | ||
dataset: pd.DataFrame, | ||
hyperparams: Hyperparameters, | ||
) -> float: | ||
y_target = dataset["Product_Supermarket_Sales"].tolist() | ||
dataset.drop(["Product_Supermarket_Sales"], axis=1, inplace=True) | ||
|
||
X_train, X_test, y_train, _ = train_test_split(dataset, y_target, test_size=0.3) | ||
|
||
scaler = RobustScaler() | ||
|
||
scaler.fit(X_train) | ||
|
||
X_train = scaler.transform(X_train) | ||
X_test = scaler.transform(X_test) | ||
|
||
gb_model = GradientBoostingRegressor( | ||
n_estimators=hyperparams.n_estimators, | ||
max_depth=hyperparams.max_depth, | ||
max_features=hyperparams.max_features, | ||
min_samples_split=hyperparams.min_samples_split, | ||
random_state=hyperparams.random_state, | ||
) | ||
|
||
return cross_validate(gb_model, hyperparams.nfolds, X_train, y_train) | ||
|
||
|
||
# %% | ||
# We define a ``workflow`` to run the notebook and the ``modeling`` task. | ||
@workflow | ||
def notebook_wf(hyperparams: Hyperparameters = Hyperparameters()) -> float: | ||
output = nb() | ||
mae_score = modeling(dataset=output.dummified_data, hyperparams=hyperparams) | ||
return mae_score | ||
|
||
|
||
# %% | ||
# We can now run the notebook and the modeling task locally. | ||
if __name__ == "__main__": | ||
print(notebook_wf()) |
Oops, something went wrong.