Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feast Integration #322

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/ghcr_push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,6 @@ jobs:
path: integrations/kubernetes
- name: kfpytorch
path: integrations/kubernetes
- name: sqlite_datacleaning
path: case_studies/feature_engineering
- name: sagemaker_training
path: integrations/aws
- name: sagemaker_pytorch
Expand All @@ -41,6 +39,8 @@ jobs:
path: integrations/flytekit_plugins
- name: house_price_prediction
path: case_studies/ml_training
- name: feast_integration
path: case_studies/feature_engineering
steps:
- uses: actions/checkout@v2
with:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,22 +26,20 @@ RUN python3.8 -m venv ${VENV}
RUN ${VENV}/bin/pip install wheel

# Install Python dependencies
COPY sqlite_datacleaning/requirements.txt /root
COPY feast_integration/requirements.txt /root
RUN ${VENV}/bin/pip install -r /root/requirements.txt

# Copy the makefile targets to expose on the container. This makes it easier to register.
COPY in_container.mk /root/Makefile
COPY sqlite_datacleaning/sandbox.config /root
COPY feast_integration/sandbox.config /root

# Copy the actual code
COPY sqlite_datacleaning/ /root/sqlite_datacleaning/
COPY feast_integration/ /root/feast_integration/

# Copy over the helper script that the SDK relies on
RUN cp ${VENV}/bin/flytekit_venv /usr/local/bin/
RUN chmod a+x /usr/local/bin/flytekit_venv

RUN pip install -U https://github.com/flyteorg/flytekit/archive/62391eaff894188bb723f382af3de29a977233ce.zip#egg=flytekit

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
PREFIX=sqlite_datacleaning
PREFIX=feast_integration
include ../../../common/Makefile
include ../../../common/leaf.mk
Original file line number Diff line number Diff line change
@@ -1,77 +1,70 @@
Data Cleaning
-------------
Feature Engineering off-late has become one of the most prominent topics in Machine Learning.
It is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

This tutorial will implement data cleaning of SQLite3 data, which does both data imputation and univariate feature selection. These are so-called feature engineering techniques.

Why SQLite3?
============
SQLite3 is written such that the task doesn't depend on the user's image. It basically:
Feast Integration
-----------------

- Shifts the burden of writing the Dockerfile from the user using the task in workflows, to the author of the task type
- Allows the author to optimize the image that the task runs
- Works locally and remotely

.. note::

SQLite3 container is special; the definition of the Python classes themselves is bundled in Flytekit, hence we just use the Flytekit image.

.. tip::
**Feature Engineering** off-late has become one of the most prominent topics in Machine Learning.
It is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

SQLite3 is being used to showcase the example of using a ``TaskTemplate``. This is the same for SQLAlchemy. As for Athena, BigQuery, Hive plugins, a container is not required. The queries are registered with FlyteAdmin and sent directly to the respective engines.
**Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production.**

Where does Flyte fit in?
========================
Flyte provides a way to train models and perform feature engineering as a single pipeline.
But, it provides no way to serve these features to production when the model matures and is ready to be served in production.

.. admonition:: What's so special about this example?
This is where the integration between Flyte and Feast can help users take their models and features from prototyping all the way to production cost-effectively and efficiently. 🚀

The pipeline doesn't build a container as such; it re-uses the pre-built task containers to construct the workflow!
In this tutorial, we'll walk through how Feast can be used to store and retrieve features to train and test the model curated using the Flyte pipeline.

Dataset
=======
We'll be using the horse colic dataset wherein we'll determine if the lesion of the horse was surgical or not. This is a modified version of the original dataset.
We'll be using the horse colic dataset wherein we'll determine if the lesion of the horse is surgical or not. This is a modified version of the original dataset.

The dataset will have the following columns:

.. list-table:: Horse Colic Features
:widths: 25 25 25
:widths: 25 25 25 25 25

* - surgery
- Age
- Hospital Number
* - rectal temperature
- rectal temperature
- pulse
- respiratory rate
* - temperature of extremities
* - respiratory rate
- temperature of extremities
- peripheral pulse
- mucous membranes
* - capillary refill time
- pain
- capillary refill time
* - pain
- peristalsis
* - abdominal distension
- abdominal distension
- nasogastric tube
- nasogastric reflux
* - nasogastric reflux PH
- rectal examination
- abdomen
* - packed cell volume
- packed cell volume
- total protein
- abdominocentesis appearance
* - abdomcentesis total protein
* - abdominocentesis appearance
- abdomcentesis total protein
- outcome
- surgical lesion
- timestamp

The horse colic dataset will be a compressed zip file consisting of the SQLite DB.

Steps to Build the Pipeline
===========================
- Define two feature engineering tasks -- "data imputation" and "univariate feature selection"
- Reference the tasks in the actual file
- Define an SQLite3 Task and generate FlyteSchema
- Pass the inputs through an imperative workflow to validate the dataset
- Return the resultant DataFrame
Why SQLite3?
^^^^^^^^^^^^
SQLite3 is written such that the task doesn't depend on the user's image. It basically:

- Shifts the burden of writing the Dockerfile from the user using the task in workflows, to the author of the task type
- Allows the author to optimize the image that the task runs
- Works locally and remotely

.. note::

SQLite3 container is special; the definition of the Python classes themselves is bundled in Flytekit, hence we just use the Flytekit image.

.. tip::

SQLite3 is being used to showcase the example of using a ``TaskTemplate``. This is the same for SQLAlchemy. As for Athena, BigQuery, Hive plugins, a container is not required. The queries are registered with FlyteAdmin and sent directly to the respective engines.

Takeaways
=========
Expand All @@ -80,11 +73,11 @@ The example we're trying to demonstrate is a simple feature engineering job that
#. Source data is from SQL-like data sources
#. Procreated feature transforms
#. Ability to create a low-code platform
#. Feast integration
#. Serve features to production using Feast
#. TaskTemplate within an imperative workflow

.. tip::

If you're a data scientist, you needn't worry about the infrastructure overhead. Flyte provides an easy-to-use interface which looks just like a typical library.

Code Walkthrough
================
Loading