flyteorg · samhita-alla · Jul 7, 2021 · Jul 7, 2021 · Jul 7, 2021 · Jul 7, 2021
@@ -27,8 +27,6 @@ jobs:
             path: integrations/kubernetes
           - name: kfpytorch
             path: integrations/kubernetes
-          - name: sqlite_datacleaning
-            path: case_studies/feature_engineering
           - name: sagemaker_training
             path: integrations/aws
           - name: sagemaker_pytorch
@@ -41,6 +39,8 @@ jobs:
             path: integrations/flytekit_plugins
           - name: house_price_prediction
             path: case_studies/ml_training
+          - name: feast_integration
+            path: case_studies/feature_engineering
     steps:
       - uses: actions/checkout@v2
         with:

@@ -26,22 +26,20 @@ RUN python3.8 -m venv ${VENV}
 RUN ${VENV}/bin/pip install wheel
 
 # Install Python dependencies
-COPY sqlite_datacleaning/requirements.txt /root
+COPY feast_integration/requirements.txt /root
 RUN ${VENV}/bin/pip install -r /root/requirements.txt
 
 # Copy the makefile targets to expose on the container. This makes it easier to register.
 COPY in_container.mk /root/Makefile
-COPY sqlite_datacleaning/sandbox.config /root
+COPY feast_integration/sandbox.config /root
 
 # Copy the actual code
-COPY sqlite_datacleaning/ /root/sqlite_datacleaning/
+COPY feast_integration/ /root/feast_integration/
 
 # Copy over the helper script that the SDK relies on
 RUN cp ${VENV}/bin/flytekit_venv /usr/local/bin/
 RUN chmod a+x /usr/local/bin/flytekit_venv
 
-RUN pip install -U https://github.com/flyteorg/flytekit/archive/62391eaff894188bb723f382af3de29a977233ce.zip#egg=flytekit 
-
 # This tag is supplied by the build script and will be used to determine the version
 # when registering tasks, workflows, and launch plans
 ARG tag

@@ -1,3 +1,3 @@
-PREFIX=sqlite_datacleaning
+PREFIX=feast_integration
 include ../../../common/Makefile
 include ../../../common/leaf.mk
@@ -1,77 +1,70 @@
-Data Cleaning
--------------
-Feature Engineering off-late has become one of the most prominent topics in Machine Learning. 
-It is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
-
-This tutorial will implement data cleaning of SQLite3 data, which does both data imputation and univariate feature selection. These are so-called feature engineering techniques.
-
-Why SQLite3?
-============
-SQLite3 is written such that the task doesn't depend on the user's image. It basically:
+Feast Integration
+-----------------
 
-- Shifts the burden of writing the Dockerfile from the user using the task in workflows, to the author of the task type
-- Allows the author to optimize the image that the task runs
-- Works locally and remotely
-
-.. note::
-
-  SQLite3 container is special; the definition of the Python classes themselves is bundled in Flytekit, hence we just use the Flytekit image.
-
-.. tip::
+**Feature Engineering** off-late has become one of the most prominent topics in Machine Learning. 
+It is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
 
-  SQLite3 is being used to showcase the example of using a ``TaskTemplate``. This is the same for SQLAlchemy. As for Athena, BigQuery, Hive plugins, a container is not required. The queries are registered with FlyteAdmin and sent directly to the respective engines.
+**Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production.**
 
-Where does Flyte fit in?
-========================
 Flyte provides a way to train models and perform feature engineering as a single pipeline. 
+But, it provides no way to serve these features to production when the model matures and is ready to be served in production. 
 
-.. admonition:: What's so special about this example?
+This is where the integration between Flyte and Feast can help users take their models and features from prototyping all the way to production cost-effectively and efficiently. 🚀
 
-  The pipeline doesn't build a container as such; it re-uses the pre-built task containers to construct the workflow!
+In this tutorial, we'll walk through how Feast can be used to store and retrieve features to train and test the model curated using the Flyte pipeline.
 
 Dataset
 =======
-We'll be using the horse colic dataset wherein we'll determine if the lesion of the horse was surgical or not. This is a modified version of the original dataset.
+We'll be using the horse colic dataset wherein we'll determine if the lesion of the horse is surgical or not. This is a modified version of the original dataset.
 
 The dataset will have the following columns:
 
 .. list-table:: Horse Colic Features
-    :widths: 25 25 25
+    :widths: 25 25 25 25 25
 
     * - surgery
       - Age
       - Hospital Number
-    * - rectal temperature
+      - rectal temperature
       - pulse
-      - respiratory rate
-    * - temperature of extremities
+    * - respiratory rate
+      - temperature of extremities
       - peripheral pulse
       - mucous membranes
-    * - capillary refill time
-      - pain
+      - capillary refill time
+    * - pain
       - peristalsis
-    * - abdominal distension
+      - abdominal distension
       - nasogastric tube
       - nasogastric reflux
     * - nasogastric reflux PH
       - rectal examination
       - abdomen
-    * - packed cell volume
+      - packed cell volume
       - total protein
-      - abdominocentesis appearance
-    * - abdomcentesis total protein
+    * - abdominocentesis appearance
+      - abdomcentesis total protein
       - outcome
       - surgical lesion
+      - timestamp
 
 The horse colic dataset will be a compressed zip file consisting of the SQLite DB.
 
-Steps to Build the Pipeline
-===========================
-- Define two feature engineering tasks -- "data imputation" and "univariate feature selection"
-- Reference the tasks in the actual file
-- Define an SQLite3 Task and generate FlyteSchema
-- Pass the inputs through an imperative workflow to validate the dataset
-- Return the resultant DataFrame
+Why SQLite3?
+^^^^^^^^^^^^
+SQLite3 is written such that the task doesn't depend on the user's image. It basically:
+
+- Shifts the burden of writing the Dockerfile from the user using the task in workflows, to the author of the task type
+- Allows the author to optimize the image that the task runs
+- Works locally and remotely
+
+.. note::
+
+  SQLite3 container is special; the definition of the Python classes themselves is bundled in Flytekit, hence we just use the Flytekit image.
+
+.. tip::
+
+  SQLite3 is being used to showcase the example of using a ``TaskTemplate``. This is the same for SQLAlchemy. As for Athena, BigQuery, Hive plugins, a container is not required. The queries are registered with FlyteAdmin and sent directly to the respective engines.
 
 Takeaways
 =========
@@ -80,11 +73,11 @@ The example we're trying to demonstrate is a simple feature engineering job that
 #. Source data is from SQL-like data sources
 #. Procreated feature transforms
 #. Ability to create a low-code platform 
+#. Feast integration
+#. Serve features to production using Feast
 #. TaskTemplate within an imperative workflow
 
 .. tip:: 
 
   If you're a data scientist, you needn't worry about the infrastructure overhead. Flyte provides an easy-to-use interface which looks just like a typical library.
 
-Code Walkthrough
-================