Merge pull request #26 from sharkutilities/featuring/timeseries/merge…

…-existing Timeseries Utility Functions for Data Frame Objects
sharkutilities · Aug 26, 2024 · c334e7a · c334e7a
2 parents 080efd6 + 61976e0
commit c334e7a
Show file tree

Hide file tree

Showing 7 changed files with 575 additions and 0 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -61,6 +61,7 @@ The above function calculates the 50th percentile, i.e., the median of the featu
 aggregate.md
 wrappers.md
 functions.md
+timeseries.md
 ```
 
 ---

diff --git a/docs/timeseries.md b/docs/timeseries.md
@@ -0,0 +1,59 @@
+# Time Series Utilities
+
+<div align = "justify">
+
+Feature engineering, time series stationarity checks are few of the use-cases that are compiled in this gists. Check
+individual module defination and functionalities as follows.
+
+## Stationarity & Unit Roots
+
+Stationarity is one of the fundamental concepts in time series analysis. The
+**time series data model works on the principle that the [_data is stationary_](https://www.analyticsvidhya.com/blog/2021/04/how-to-check-stationarity-of-data-in-python/)
+and [_data has no unit roots_](https://www.analyticsvidhya.com/blog/2018/09/non-stationary-time-series-python/)**, this means:
+  * the data must have a constant mean (across all periods),
+  * the data should have a constant variance, and
+  * auto-covariance should not be dependent on time.
+
+Let's understand the concept using the following example, for more information check [this link](https://www.analyticsvidhya.com/blog/2018/09/non-stationary-time-series-python/).
+
+![Non-Stationary Time Series](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/09/ns5-e1536673990684.png)
+
+<div align = "center">
+
+| ADF Test | KPSS Test | Series Type | Additional Steps |
+| :---: | :---: | :---: | --- |
+| ✅ | ✅ | _stationary_ | |
+| ❌ | ❌ | _non-stationary_ | |
+| ✅ | ❌ | _difference-stationary_ | Use differencing to make series stationary. |
+| ❌ | ✅ | _trend-stationary_ | Remove trend to make the series _strict stationary. |
+
+</div>
+
+```{eval-rst}
+.. automodule:: pandaswizard.timeseries.stationarity
+  :members:
+  :undoc-members:
+  :show-inheritance:
+```
+
+## Time Series Featuring
+
+Time series analysis is a special segment of AI/ML application development where a feature is dependent on time. The code here
+is desgined to create a *sequence* of `x` and `y` data needed in a time series problem. The function is defined with two input
+parameters (I) **Lootback Period (T) `n_lookback`**, and (II) **Forecast Period (H) `n_forecast`** which can be visually
+presented below.
+
+<div align = "center">
+
+![prediction-sequence](https://i.stack.imgur.com/YXwMJ.png)
+
+</div>
+
+```{eval-rst}
+.. automodule:: pandaswizard.timeseries.ts_featuring
+  :members:
+  :undoc-members:
+  :show-inheritance:
+```
+
+</div>
diff --git a/pandaswizard/__init__.py b/pandaswizard/__init__.py
@@ -28,3 +28,4 @@
 from pandaswizard import window
 from pandaswizard import wrappers
 from pandaswizard import functions
+from pandaswizard import timeseries
diff --git a/pandaswizard/timeseries/__init__.py b/pandaswizard/timeseries/__init__.py
@@ -0,0 +1,29 @@
+# -*- encoding: utf-8 -*-
+
+"""
+A Set of Utility Functions for a Time Series Data for ``pandas``
+
+A time series analysis is a way of analyzing a sequence of data which
+was collected over a period of time and the data is collected at a
+specific intervals. The :mod:`pandas` provides various functions to
+manipulate a time object which is a wrapper over the ``datetime``
+module and is a type of ``pd.Timestamp`` object.
+
+The module provides some utility functions for a time series data
+and tests like "stationarity" which is a must in EDA!
+
+.. caution::
+    The time series module consists of functions which was earlier
+    developed in `GH/sqlparser <https://gist.github.com/ZenithClown/f99d7e1e3f4b4304dd7d43603cef344d>`_.
+
+.. note::
+    More details on why the module was merged is
+    available `GH/#24 <https://github.com/sharkutilities/pandas-wizard/issues/24>`_.
+
+.. note::
+    Code migration details are mentioned
+    `GH/#27 <https://github.com/sharkutilities/pandas-wizard/issues/27>`_.
+"""
+
+from pandaswizard.timeseries.stationarity import *
+from pandaswizard.timeseries.ts_featuring import *
diff --git a/pandaswizard/timeseries/stationarity.py b/pandaswizard/timeseries/stationarity.py
@@ -0,0 +1,86 @@
+# -*- encoding: utf-8 -*-
+
+"""
+Stationarity Checking for Time Series Data
+
+A functional approach to check stationarity using different models
+and the function attrbutes are as defined below.
+(`More Information <http://ds-gringotts.readthedocs.io/>`_)
+"""
+
+from statsmodels.tsa.stattools import kpss # kpss test
+from statsmodels.tsa.stattools import adfuller # adfuller test
+
+def checkStationarity(frame : object, feature: str, method : str = "both", verbose : bool = True, **kwargs) -> bool:
+    """
+    Performs ADF Test to Determine Data Stationarity
+
+    Given an univariate series formatted as `frame.set_index("data")`
+    the series can be tested for stationarity using the Augmented
+    Dickey Fuller (ADF) and Kwiatkowski-Phillips-Schmidt-Shin (KPSS)
+    test. The function also returns a `dataframe` of rolling window
+    for plotting the data using `frame.plot()`.
+
+    :type  frame: pd.DataFrame
+    :param frame: The dataframe that (ideally) contains a single
+                  univariate feature (`feature`), else for a
+                  dataframe containing multiple series only the
+                  `feature` series is worked upon.
+
+    :type  feature: str
+    :param feature: Name of the feature, i.e. the column name
+                    in the dataframe. The `rolling` dataframe returns
+                    a slice of `frame[[feature]]` along with rolling
+                    mean and standard deviation.
+
+    :type  method: str
+    :param method: Select any of the method ['ADF', 'KPSS', 'both'],
+                   using the `method` parameter, name is case
+                   insensitive. Defaults to `both`.
+    """
+
+    results = dict() # key is `ADF` and/or `KPSS`
+    stationary = dict()
+
+    if method.upper() in ["ADF", "BOTH"]:
+        results["ADF"] = adfuller(frame[feature].values) # should be send like `frame.col.values`
+        stationary["ADF"] = True if (results["ADF"][1] <= 0.05) & (results["ADF"][4]["5%"] > results["ADF"][0]) else False
+
+        if verbose:
+            print(f"Observations of ADF Test ({feature})")
+            print("===========================" + "=" * len(feature))
+            print(f"ADF Statistics  : {results['ADF'][0]:,.3f}")
+            print(f"p-value         : {results['ADF'][1]:,.3f}")
+
+            critical_values = {k : round(v, 3) for k, v in results["ADF"][4].items()}
+            print(f"Critical Values : {critical_values}")
+
+        # always print if data is stationary/not
+        print(f"[ADF] Data is   :", "\u001b[32mStationary\u001b[0m" if stationary else "\x1b[31mNon-stationary\x1b[0m")
+
+    if method.upper() in ["KPSS", "BOTH"]:
+        results["KPSS"] = kpss(frame[feature].values) # should be send like `frame.col.values`
+        stationary["KPSS"] = False if (results["KPSS"][1] <= 0.05) & (results["KPSS"][3]["5%"] > results["KPSS"][0]) else True
+
+        if verbose:
+            print(f"Observations of KPSS Test ({feature})")
+            print("============================" + "=" * len(feature))
+            print(f"KPSS Statistics : {results['KPSS'][0]:,.3f}")
+            print(f"p-value         : {results['KPSS'][1]:,.3f}")
+
+            critical_values = {k : round(v, 3) for k, v in results["KPSS"][3].items()}
+            print(f"Critical Values : {critical_values}")
+
+        # always print if data is stationary/not
+        print(f"[KPSS] Data is  :", "\x1b[31mNon-stationary\x1b[0m" if stationary else "\u001b[32mStationary\u001b[0m")
+
+    # rolling calculations for plotting
+    rolling = frame.copy() # enable deep copy
+    rolling = rolling[[feature]] # only keep single feature, works if multi-feature sent
+    rolling.rename(columns = {feature : "original"}, inplace = True)
+
+    rolling_ = rolling.rolling(window = kwargs.get("window", 12))
+    rolling["mean"] = rolling_.mean()["original"].values
+    rolling["std"] = rolling_.std()["original"].values
+
+    return results, stationary, rolling
diff --git a/pandaswizard/timeseries/ts_featuring.py b/pandaswizard/timeseries/ts_featuring.py
@@ -0,0 +1,202 @@
+# -*- encoding: utf-8 -*-
+
+"""
+A Set of Methodologies involved with Feature Engineering
+
+Feature engineering or feature extraction involves transforming the
+data by manipulation (addition, deletion, combination) or mutation of
+the data set in hand to improve the machine learning model. The
+project mainly deals with, but not limited to, time series data that
+requires special treatment - which are listed over here.
+
+Feature engineering time series data will incorporate the use case of
+both univariate and multivariate data series with additional
+parameters like lookback and forward tree. Check documentation of the
+function(s) for more information.
+"""
+
+import numpy as np
+import pandas as pd
+
+
+class DataObjectModel(object):
+    """
+    Data Object Model (`DOM`) for AI-ML Application Development
+
+    Data is the key to an artificial intelligence application
+    development, and often times real world data are gibrish and
+    incomprehensible. The DOM is developed to provide basic use case
+    like data formatting, seperating `x` and `y` variables etc. such
+    that a feature engineering function or a machine learning model
+    can easily get the required information w/o much needed code.
+
+    # Example Use Cases
+    The following small use cases are possible with the use of the
+    DOM in feature engineering:
+
+    1. Formatting a Data to a NumPy ND-Array - an iterable/pandas
+       object can be converted into `np.ndarray` which is the base
+       data type of the DOM.
+
+       ```python
+       np.random.seed(7) # set seed for duplication
+       data = pd.DataFrame(
+        data = np.random.random(size = (9, 26)),
+        columns = list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
+       )
+
+       dom = DataObjectModel(data)
+       print(type(dom.data))
+       >> <class 'numpy.ndarray'>
+       ```
+
+    2. Breaking an Array of `Xy` into Individual Component - for
+       instance a dataframe/tabular data has `X` features along side
+       `y` in column. The function considers the data and breaks it
+       into individual components.
+
+       ```python
+       X, y = dom.create_xy(y_index = 1)
+
+       # or if `y` is group of elements then:
+       X, y = dom.create_xy(y_index = (1, 4))
+       ```
+    """
+
+    def __init__(self, data: np.ndarray) -> None:
+        self.data = self.__to_numpy__(data)  # also check integrity
+
+    def __to_numpy__(self, data: object) -> np.ndarray:
+        """Convert Meaningful Data into a N-Dimensional Array"""
+
+        if type(data) == np.ndarray:
+            pass # data is already in required type
+        elif type(data) in [list, tuple]:
+            data = np.array(data)
+        elif type(data) == pd.DataFrame:
+            # often times a full df can be passed, which is a ndarray
+            # thus, the df can be easily converted to an np ndarray:
+            data = data.values
+        else:
+            raise TypeError(
+                f"Data `type == {type(data)}` is not convertible.")
+
+        return data
+
+
+    def create_xy(self, y_index : object = -1) -> tuple:
+        """
+        Breaks the Data into Individual `X` and `y` Components
+
+        From a tabular or ndimensional structure, the code considers
+        `y` along a given axis (`y_index`) and returns two `ndarray`
+        which can be treated as `X` and `y` individually.
+
+        The function uses `np.delete` command to create `X` feature
+        from the data. (https://stackoverflow.com/a/5034558/6623589).
+
+        This function is meant for multivariate dataset, and is only
+        applicable when dealing with multivariate time series data.
+        The function can also be used for any machine learning model
+        consisting of multiple features (even if it is a time series
+        dataset).
+
+        :type  y_index: object
+        :param y_index: Index/axis of `y` variable. If the type is
+                        is `int` then the code assumes one feature,
+                        and `y_.shape == (-1, 1)` and if the type
+                        of `y_index` is `tuple` then
+                        `y_.shape == (-1, (end - start - 1))` since
+                        end index is exclusive as in `numpy` module.
+        """
+
+        if type(y_index) in [list, tuple]:
+            x_ = self.data
+            y_ = self.data[:, y_index[0]:y_index[1]]
+            for idx in range(*y_index)[::-1]:
+                x_ = np.delete(x_, obj = idx, axis = 1)
+        elif type(y_index) == int:
+            y_ = self.data[:, y_index]
+            x_ = np.delete(self.data, obj = y_index, axis = 1)
+        else:
+            raise TypeError("`type(y_index)` not in [int, tuple].")
+
+        return x_, y_
+
+
+class CreateSequence(DataObjectModel):
+    """
+    Create a Data Sequence Typically to be used in LSTM Model
+
+    LSTM Model, or rather any time series data, requires a specific
+    sequence of data consisting of `n_lookback` i.e. length of input
+    sequence (or lookup values) and `n_forecast` values, i.e., the
+    length of output sequence. The function tries to provide single
+    approach to break data into sequence of `x_train` and `y_train`
+    for training in neural network.
+    """
+
+    def __init__(self, data: np.ndarray) -> None:
+        super().__init__(data)
+
+
+    def create_series(
+        self,
+        n_lookback : int,
+        n_forecast : int,
+        univariate : bool = True,
+        **kwargs
+    ) -> tuple:
+        """
+        Create a Sequence of `x_train` and `y_train` for training a
+        neural network model with time series data. The basic
+        approach in building the function is taken from:
+        https://stackoverflow.com/a/69912334/6623589
+
+        UPDATE [22-02-2023] : The function is now modified such that
+        it now can also return a sequence for multivariate time-series
+        analysis. The following changes has been added:
+        * 💣 refactor function name to generalise between univariate
+          and multivariate methods.
+        * 🔧 univariate feature can be called directly as this is the
+          default code behaviour.
+        * 🛠 to get multivariate functionality, use `univariate = False`
+        * 🛠 by default the last column (-1) of `data` is considered
+          as `y` feature by slicing `arr[s:e, -1]` but this can be
+          configured using `kwargs["y_feat_"]`
+        """
+
+        x_, y_ = [], []
+        n_record = self.__check_univariate_get_len__(univariate) \
+            - n_forecast + 1
+
+        y_feat_ = kwargs.get("y_feat_", -1)
+
+        for idx in range(n_lookback, n_record):
+            x_.append(self.data[idx - n_lookback : idx])
+            y_.append(
+                self.data[idx : idx + n_forecast] if univariate
+                else self.data[idx : idx + n_forecast, y_feat_]
+            )
+
+        x_, y_ = map(np.array, [x_, y_])
+
+        if univariate:
+            # the for loop is so designed it returns the data like:
+            # (<records>, n_lookback, <features>) however,
+            # for univariate the `<features>` dimension is "squeezed"
+            x_, y_ = map(lambda arr : np.squeeze(arr), [x_, y_])
+
+        return [x_, y_]
+
+
+    def __check_univariate_get_len__(self, univariate : bool) -> int:
+        """
+        Check if the data is a univariate one, and if `True` then
+        return the length, i.e. `.shape[0]`, for further analysis.
+        """
+
+        if (self.data.ndim != 1) and (univariate):
+            raise TypeError("Wrong dimension for univariate series.")
+
+        return self.data.shape[0]