add time series tutorial

Signed-off-by: Niels Bantilan <[email protected]>
flyteorg · Oct 2, 2024 · b5005ac · b5005ac
1 parent a1dde19
commit b5005ac
Show file tree

Hide file tree

Showing 7 changed files with 194 additions and 0 deletions.
diff --git a/docs/ml_training.md b/docs/ml_training.md
@@ -16,6 +16,8 @@ Understand how machine learning models can be trained from within Flyte, with an
   - Word embedding and topic modelling on lee background corpus with Gensim
 * - {doc}`Forecast Sales Using Rossmann Store Sales <auto_examples/forecasting_sales/index>`
   - Forecast sales data with data-parallel distributed training using Horovod on Spark.
+* - {doc}`Time Series Modeling <auto_examples/time_series_modeling/index>`
+  - Train models for making forecasts on time series data.
 ```
 
 ```{toctree}
@@ -28,4 +30,5 @@ auto_examples/house_price_prediction/index
 auto_examples/mnist_classifier/index
 auto_examples/nlp_processing/index
 auto_examples/forecasting_sales/index
+auto_examples/time_series_modeling/index
 ```
diff --git a/docs/tutorials.md b/docs/tutorials.md
@@ -38,6 +38,8 @@ Train machine learning models from using your framework of choice.
   - Word embedding and topic modelling on lee background corpus with Gensim
 * - {doc}`Sales Forecasting <auto_examples/forecasting_sales/index>`
   - Use the Rossmann Store data to forecast sales with distributed training using Horovod on Spark.
+* - {doc}`Time Series Modeling <auto_examples/time_series_modeling/index>`
+  - Train models for making forecasts on time series data.
 ```
 
 ## 🛠 Feature Engineering

diff --git a/examples/time_series_modeling/Dockerfile b/examples/time_series_modeling/Dockerfile
@@ -0,0 +1,31 @@
+FROM python:3.8-slim-buster
+LABEL org.opencontainers.image.source https://github.com/flyteorg/flytesnacks
+
+WORKDIR /root
+ENV VENV /opt/venv
+ENV LANG C.UTF-8
+ENV LC_ALL C.UTF-8
+ENV PYTHONPATH /root
+
+# This is necessary for opencv to work
+RUN apt-get update && apt-get install -y libsm6 libxext6 libxrender-dev ffmpeg build-essential curl
+
+WORKDIR /root
+
+ENV VENV /opt/venv
+# Virtual environment
+RUN python3 -m venv ${VENV}
+ENV PATH="${VENV}/bin:$PATH"
+
+# Install Python dependencies
+COPY requirements.in /root
+RUN pip install -r /root/requirements.in
+RUN pip freeze
+
+# Copy the actual code
+COPY . /root
+
+# This tag is supplied by the build script and will be used to determine the version
+# when registering tasks, workflows, and launch plans
+ARG tag
+ENV FLYTE_INTERNAL_IMAGE $tag
diff --git a/examples/time_series_modeling/README.md b/examples/time_series_modeling/README.md
@@ -0,0 +1,45 @@
+(time_series_modeling)=
+
+# Time Series Modeling
+
+```{eval-rst}
+.. tags:: Advanced, MachineLearning
+```
+
+Time series data is fundamentally different from Independent and Identically
+Distributed (IID) data, which is commonly used in many machine learning tasks.
+Here are a few key differences:
+
+1. **Temporal Dependency**: In time series data, observations are ordered
+   chronologically and exhibit temporal dependencies. Each data point is related
+   to its past and future values. This sequential nature is crucial for
+   forecasting and trend analysis. In contrast, IID data assumes that each
+   observation is independent of others.
+2. **Non-stationarity**: Time series often display trends, seasonality, or cyclic
+   patterns that evolve over time. This non-stationarity means that statistical
+   properties like mean and variance can change, making analysis more complex. IID
+   data, by definition, maintains constant statistical properties.
+3. **Autocorrelation**: Time series data frequently shows autocorrelation, where
+   an observation is correlated with its own past values. This feature is essential
+   for many time series models but is not the case for IID data.
+4. **Importance of Order**: The sequence of observations in time series data is
+   critical and cannot be shuffled without losing information. In IID data, the
+   order of observations is assumed to be irrelevant.
+5. **Inference is Focused on Forecasting**: Time series analysis often aims to
+   predict future values based on historical patterns, whereas many machine
+   learning tasks with IID data focus on classification or regression without
+   a temporal component.
+6. **Specific Modeling Techniques**: Time series data requires specialized
+   modeling techniques like ARIMA, Prophet, or RNNs that can capture temporal
+   dynamics. These models are not typically used with IID data.
+
+Understanding these differences is crucial for selecting appropriate analysis
+methods and interpreting results in time series modeling tasks.
+
+Below are examples demonstrating how to use Flyte to train time series models.
+
+## Examples
+
+```{auto-examples-toc}
+neural_prophet
+```
diff --git a/examples/time_series_modeling/requirements.in b/examples/time_series_modeling/requirements.in
@@ -0,0 +1,4 @@
+flytekit>=1.7.0
+wheel
+matplotlib
+flytekitplugins-deck-standard
diff --git a/examples/time_series_modeling/time_series_modeling/__init__.py b/examples/time_series_modeling/time_series_modeling/__init__.py
diff --git a/examples/time_series_modeling/time_series_modeling/neural_prophet.py b/examples/time_series_modeling/time_series_modeling/neural_prophet.py
@@ -0,0 +1,109 @@
+# %% [markdown]
+# # Train a Neural Prophet Model
+#
+# This script demonstrates how to train a model for time series forecasting
+# using the [neural prophet](https://neuralprophet.com/) library.
+
+# %% [markdown]
+# ## Imports and Setup
+#
+# First, we import necessary libraries to run the training workflow.
+
+import pandas as pd
+from flytekit import current_context, task, workflow, Deck, ImageSpec
+from flytekit.types.file import FlyteFile
+
+# %% [markdown]
+# ## Define an ImageSpec
+#
+# For reproducibility, we create an `ImageSpec` object with required packages
+# for our tasks.
+
+image = ImageSpec(
+    name="neuralprophet",
+    packages=[
+        "neuralprophet",
+        "matplotlib",
+        "ipython",
+        "pandas",
+        "pyarrow",
+    ],
+    # This registry is for a local flyte demo cluster. Replace this with your
+    # own registry, e.g. `docker.io/<username>/<imagename>`
+    registry="localhost:30000"
+)
+
+# %% [markdown]
+# ## Data Loading Task
+#
+# This task loads the time series data from the specified URL. In this case,
+# we use a hard-coded URL for a sample dataset that ships with the neural prophet.
+
+URL = "https://github.com/ourownstory/neuralprophet-data/raw/main/kaggle-energy/datasets/tutorial01.csv"
+
+@task(container_image=image)
+def load_data() -> pd.DataFrame:
+    return pd.read_csv(URL)
+
+# %% [markdown]
+# ## Model Training Task
+#
+# This task trains the Neural Prophet model on the loaded data.
+# We train the model in the hourly frequency for ten epochs.
+
+@task(container_image=image)
+def train_model(df: pd.DataFrame) -> FlyteFile:
+    from neuralprophet import NeuralProphet, save
+
+    working_dir = current_context().working_directory
+    model = NeuralProphet()
+    model.fit(df, freq="H", epochs=10)
+    model_fp = f"{working_dir}/model.np"
+    save(model, model_fp)
+    return FlyteFile(model_fp)
+
+# %% [markdown]
+# ## Forecasting Task
+#
+# This task loads the trained model, makes predictions, and visualizes the
+# results using a Flyte Deck.
+
+@task(
+    container_image=image,
+    enable_deck=True,
+)
+def make_forecast(df: pd.DataFrame, model_file: FlyteFile) -> pd.DataFrame:
+    from neuralprophet import load
+
+    model_file.download()
+    model = load(model_file.path)
+
+    # Create a new dataframe reaching 365 into the future
+    # for our forecast, n_historic_predictions also shows historic data
+    df_future = model.make_future_dataframe(
+        df,
+        n_historic_predictions=True,
+        periods=365,
+    )
+
+    # Predict the future
+    forecast = model.predict(df_future)
+
+    # Plot on a Flyte Deck
+    fig = model.plot(forecast)
+    Deck("Forecast", fig.to_html())
+
+    return forecast
+
+# %% [markdown]
+# ## Main Workflow
+#
+# Finally, this workflow orchestrates the entire process: loading data,
+# training the model, and making forecasts.
+
+@workflow
+def main() -> pd.DataFrame:
+    df = load_data()
+    model_file = train_model(df)
+    forecast = make_forecast(df, model_file)
+    return forecast