diff --git a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-bike-share/auto-ml-forecasting-bike-share.ipynb b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-bike-share/auto-ml-forecasting-bike-share.ipynb index b4d25b0076..a7f4eb319b 100644 --- a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-bike-share/auto-ml-forecasting-bike-share.ipynb +++ b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-bike-share/auto-ml-forecasting-bike-share.ipynb @@ -1,7 +1,6 @@ { "cells": [ { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -11,7 +10,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -19,7 +17,13 @@ ] }, { - "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "!Important!
This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-task-bike-share)).
" + ] + }, + { "cell_type": "markdown", "metadata": {}, "source": [ @@ -37,7 +41,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -56,7 +59,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -86,7 +88,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -103,7 +104,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -137,7 +137,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -177,7 +176,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -201,7 +199,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { @@ -218,6 +215,7 @@ "cell_type": "code", "execution_count": null, "metadata": { + "collapsed": false, "jupyter": { "outputs_hidden": false, "source_hidden": false @@ -237,7 +235,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { @@ -256,6 +253,7 @@ "cell_type": "code", "execution_count": null, "metadata": { + "collapsed": false, "gather": { "logged": 1680247376789 }, @@ -277,7 +275,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -290,6 +287,7 @@ "cell_type": "code", "execution_count": null, "metadata": { + "collapsed": false, "jupyter": { "outputs_hidden": false, "source_hidden": false @@ -316,7 +314,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -334,7 +331,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -359,7 +355,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -378,7 +373,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -398,7 +392,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -441,7 +434,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -467,7 +459,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -486,7 +477,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -512,7 +502,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -556,7 +545,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -564,7 +552,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -583,7 +570,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -606,7 +592,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -637,7 +622,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -656,7 +640,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -673,7 +656,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -705,7 +687,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -715,7 +696,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -747,7 +727,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -822,7 +801,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.9" + "version": "3.8.10" }, "microsoft": { "ms_spell_check": { diff --git a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb index 4f097661fa..03b4408679 100644 --- a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb +++ b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb @@ -2,22 +2,30 @@ "cells": [ { "cell_type": "markdown", + "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.png)" - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, + "source": [ + "!Important!
This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-task-energy-demand/automl-forecasting-task-energy-demand-advanced-mlflow.ipynb)).
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, "source": [ "# Automated Machine Learning\n", "_**Forecasting using the Energy Demand Dataset**_\n", @@ -32,11 +40,11 @@ "Advanced Forecasting\n", "1. [Advanced Training](#advanced_training)\n", "1. [Advanced Results](#advanced_results)" - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "# Introduction\n", "\n", @@ -52,18 +60,20 @@ "1. Generate the forecast and compute the out-of-sample accuracy metrics\n", "1. Configuration and remote run of AutoML for a time-series model with lag and rolling window features\n", "1. Run and explore the forecast with lagging features" - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "# Setup" - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "import json\n", "import logging\n", @@ -82,36 +92,36 @@ "from azureml.core import Experiment, Workspace, Dataset\n", "from azureml.train.automl import AutoMLConfig\n", "from datetime import datetime" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "This notebook is compatible with Azure ML SDK version 1.35.0 or later." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "ws = Workspace.from_config()\n", "\n", @@ -133,13 +143,11 @@ "pd.set_option(\"display.max_colwidth\", None)\n", "outputDf = pd.DataFrame(data=output, index=[\"\"])\n", "outputDf.T" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "## Create or Attach existing AmlCompute\n", "A compute target is required to execute a remote Automated ML run. \n", @@ -149,11 +157,13 @@ "#### Creation of AmlCompute takes approximately 5 minutes. \n", "If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n", "As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from azureml.core.compute import ComputeTarget, AmlCompute\n", "from azureml.core.compute_target import ComputeTargetException\n", @@ -172,24 +182,22 @@ " compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\n", "\n", "compute_target.wait_for_completion(show_output=True)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "# Data\n", "\n", "We will use energy consumption [data from New York City](http://mis.nyiso.com/public/P-58Blist.htm) for model training. The data is stored in a tabular format and includes energy demand and basic weather data at an hourly frequency. \n", "\n", "With Azure Machine Learning datasets you can keep a single copy of data in your storage, easily access data during model training, share data and collaborate with other users. Below, we will upload the datatset and create a [tabular dataset](https://docs.microsoft.com/bs-latn-ba/azure/machine-learning/service/how-to-create-register-datasets#dataset-types) to be used training and prediction." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "Let's set up what we know about the dataset.\n", "\n", @@ -197,64 +205,66 @@ "Time column is the time axis along which to predict.\n", "\n", "The other columns, \"temp\" and \"precip\", are implicitly designated as features." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "target_column_name = \"demand\"\n", "time_column_name = \"timeStamp\"" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "dataset = Dataset.Tabular.from_delimited_files(\n", " path=\"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/nyc_energy.csv\"\n", ").with_timestamp_columns(fine_grain_timestamp=time_column_name)\n", "dataset.take(5).to_pandas_dataframe().reset_index(drop=True)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "The NYC Energy dataset is missing energy demand values for all datetimes later than August 10th, 2017 5AM. Below, we trim the rows containing these missing values from the end of the dataset." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "# Cut off the end of the dataset due to large number of nan values\n", "dataset = dataset.time_before(datetime(2017, 10, 10, 5))" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "## Split the data into train and test sets" - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "The first split we make is into train and test sets. Note that we are splitting on time. Data before and including August 8th, 2017 5AM will be used for training, and data after will be used for testing." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "# split into train based on time\n", "train = (\n", @@ -263,13 +273,13 @@ " .reset_index(drop=True)\n", ")\n", "train.sort_values(time_column_name).tail(5)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "# split into test based on time\n", "test = (\n", @@ -278,13 +288,23 @@ " .reset_index(drop=True)\n", ")\n", "test.head(5)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], "source": [ "# register the splitted train and test data in workspace storage\n", "from azureml.data.dataset_factory import TabularDatasetFactory\n", @@ -296,23 +316,11 @@ "test_dataset = TabularDatasetFactory.register_pandas_dataframe(\n", " test, target=(datastore, \"dataset/\"), name=\"nyc_energy_test\"\n", ")" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - } + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "### Setting the maximum forecast horizon\n", "\n", @@ -321,20 +329,20 @@ "Learn more about forecast horizons in our [Auto-train a time-series forecast model](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-auto-train-forecast#configure-and-run-experiment) guide.\n", "\n", "In this example, we set the horizon to 48 hours." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "forecast_horizon = 48" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "## Forecasting Parameters\n", "To define forecasting parameters for your experiment training, you can leverage the ForecastingParameters class. The table below details the forecasting parameter we will be passing into our experiment.\n", @@ -345,11 +353,11 @@ "|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n", "|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n", "|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "# Train\n", "\n", @@ -367,18 +375,20 @@ "|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n", "|**enable_early_stopping**|Flag to enble early termination if the score is not improving in the short term.|\n", "|**forecasting_parameters**|A class holds all the forecasting related parameters.|\n" - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "This notebook uses the blocked_models parameter to exclude some models that take a longer time to train on this dataset. You can choose to remove models from the blocked_models list but you may need to increase the experiment_timeout_hours parameter value to get results." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from azureml.automl.core.forecasting_parameters import ForecastingParameters\n", "\n", @@ -402,65 +412,65 @@ " verbosity=logging.INFO,\n", " forecasting_parameters=forecasting_parameters,\n", ")" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "Call the `submit` method on the experiment object and pass the run configuration. Depending on the data and the number of iterations this can run for a while.\n", "One may specify `show_output = True` to print currently running iterations to the console." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "remote_run = experiment.submit(automl_config, show_output=False)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "remote_run.wait_for_completion()" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "## Retrieve the Best Run details\n", "Below we retrieve the best Run object from among all the runs in the experiment." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "best_run = remote_run.get_best_child()\n", "best_run" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "## Featurization\n", "We can look at the engineered feature names generated in time-series featurization via. the JSON file named 'engineered_feature_names.json' under the run outputs." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "# Download the JSON file locally\n", "best_run.download_file(\n", @@ -470,13 +480,11 @@ " records = json.load(f)\n", "\n", "records" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "### View featurization summary\n", "You can also see what featurization steps were performed on different raw features in the user data. For each raw feature in the user data, the following information is displayed:\n", @@ -486,11 +494,13 @@ "+ Type detected\n", "+ If feature was dropped\n", "+ List of feature transformations for the raw feature" - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "# Download the featurization summary JSON file locally\n", "best_run.download_file(\n", @@ -512,41 +522,41 @@ " \"Transformations\",\n", " ]\n", "]" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "# Forecasting\n", "\n", "Now that we have retrieved the best pipeline/model, it can be used to make predictions on test data. We will do batch scoring on the test dataset which should have the same schema as training dataset.\n", "\n", "The inference will run on a remote compute. In this example, it will re-use the training compute." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "test_experiment = Experiment(ws, experiment_name + \"_inference\")" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "### Retrieving forecasts from the model\n", "We have created a function called `run_forecast` that submits the test data to the best model determined during the training run and retrieves forecasts. This function uses a helper script `forecasting_script` which is uploaded and expecuted on the remote compute." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from run_forecast import run_remote_inference\n", "\n", @@ -561,32 +571,32 @@ "\n", "# download the inference output file to the local machine\n", "remote_run_infer.download_file(\"outputs/predictions.csv\", \"predictions.csv\")" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "### Evaluate\n", "To evaluate the accuracy of the forecast, we'll compare against the actual sales quantities for some select metrics, included the mean absolute percentage error (MAPE). For more metrics that can be used for evaluation after training, please see [supported metrics](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#regressionforecasting-metrics), and [how to calculate residuals](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#residuals)." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "# load forecast data frame\n", "fcst_df = pd.read_csv(\"predictions.csv\", parse_dates=[time_column_name])\n", "fcst_df.head()" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from azureml.automl.core.shared import constants\n", "from azureml.automl.runtime.shared.score import scoring\n", @@ -613,31 +623,31 @@ " (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n", ")\n", "plt.show()" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "# Advanced Training \n", "We did not use lags in the previous model specification. In effect, the prediction was the result of a simple regression on date, time series identifier columns and any additional features. This is often a very good prediction as common time series patterns like seasonality and trends can be captured in this manner. Such simple regression is horizon-less: it doesn't matter how far into the future we are predicting, because we are not using past data. In the previous example, the horizon was only used to split the data for cross-validation." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "### Using lags and rolling window features\n", "Now we will configure the target lags, that is the previous values of the target variables, meaning the prediction is no longer horizon-less. We therefore must still specify the `forecast_horizon` that the model will learn to forecast. The `target_lags` keyword specifies how far back we will construct the lags of the target variable, and the `target_rolling_window_size` specifies the size of the rolling window over which we will generate the `max`, `min` and `sum` features.\n", "\n", "This notebook uses the blocked_models parameter to exclude some models that take a longer time to train on this dataset. You can choose to remove models from the blocked_models list but you may need to increase the iteration_timeout_minutes parameter value to get results." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "advanced_forecasting_parameters = ForecastingParameters(\n", " time_column_name=time_column_name,\n", @@ -668,63 +678,63 @@ " verbosity=logging.INFO,\n", " forecasting_parameters=advanced_forecasting_parameters,\n", ")" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "We now start a new remote run, this time with lag and rolling window featurization. AutoML applies featurizations in the setup stage, prior to iterating over ML models. The full training set is featurized first, followed by featurization of each of the CV splits. Lag and rolling window features introduce additional complexity, so the run will take longer than in the previous example that lacked these featurizations." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "advanced_remote_run = experiment.submit(automl_config, show_output=False)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "advanced_remote_run.wait_for_completion()" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "### Retrieve the Best Run details" - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "best_run_lags = remote_run.get_best_child()\n", "best_run_lags" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": {}, "source": [ "# Advanced Results\n", "We did not use lags in the previous model specification. In effect, the prediction was the result of a simple regression on date, time series identifier columns and any additional features. This is often a very good prediction as common time series patterns like seasonality and trends can be captured in this manner. Such simple regression is horizon-less: it doesn't matter how far into the future we are predicting, because we are not using past data. In the previous example, the horizon was only used to split the data for cross-validation." - ], - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "test_experiment_advanced = Experiment(ws, experiment_name + \"_inference_advanced\")\n", "advanced_remote_run_infer = run_remote_inference(\n", @@ -741,23 +751,23 @@ "advanced_remote_run_infer.download_file(\n", " \"outputs/predictions.csv\", \"predictions_advanced.csv\"\n", ")" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "fcst_adv_df = pd.read_csv(\"predictions_advanced.csv\", parse_dates=[time_column_name])\n", "fcst_adv_df.head()" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from azureml.automl.core.shared import constants\n", "from azureml.automl.runtime.shared.score import scoring\n", @@ -786,10 +796,7 @@ " (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n", ")\n", "plt.show()" - ], - "outputs": [], - "execution_count": null, - "metadata": {} + ] } ], "metadata": { @@ -802,40 +809,40 @@ "how-to-use-azureml", "automated-machine-learning" ], + "kernel_info": { + "name": "python3" + }, "kernelspec": { - "name": "python3", + "display_name": "Python 3.8 - AzureML", "language": "python", - "display_name": "Python 3 (ipykernel)" + "name": "python38-azureml" }, "language_info": { - "name": "python", - "version": "3.8.5", - "mimetype": "text/x-python", "codemirror_mode": { "name": "ipython", "version": 3 }, - "pygments_lexer": "ipython3", + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", "nbconvert_exporter": "python", - "file_extension": ".py" - }, - "vscode": { - "interpreter": { - "hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca" - } + "pygments_lexer": "ipython3", + "version": "3.8.10" }, "microsoft": { "ms_spell_check": { "ms_spell_check_language": "en" } }, - "kernel_info": { - "name": "python3" - }, "nteract": { "version": "nteract-front-end@1.0.0" + }, + "vscode": { + "interpreter": { + "hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca" + } } }, "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file + "nbformat_minor": 4 +} diff --git a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-github-dau/auto-ml-forecasting-github-dau.ipynb b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-github-dau/auto-ml-forecasting-github-dau.ipynb index 5d858f0729..7bc17c7f5e 100644 --- a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-github-dau/auto-ml-forecasting-github-dau.ipynb +++ b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-github-dau/auto-ml-forecasting-github-dau.ipynb @@ -22,6 +22,13 @@ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-beer-remote/auto-ml-forecasting-beer-remote.png)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "!Important!
This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-github-dau)).
" + ] + }, { "cell_type": "markdown", "metadata": { @@ -695,7 +702,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.9" + "version": "3.8.10" } }, "nbformat": 4, diff --git a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-hierarchical-timeseries/auto-ml-forecasting-hierarchical-timeseries.ipynb b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-hierarchical-timeseries/auto-ml-forecasting-hierarchical-timeseries.ipynb index 1e65c10331..6bace379bf 100644 --- a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-hierarchical-timeseries/auto-ml-forecasting-hierarchical-timeseries.ipynb +++ b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-hierarchical-timeseries/auto-ml-forecasting-hierarchical-timeseries.ipynb @@ -16,6 +16,13 @@ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-hierarchical-timeseries/auto-ml-forecasting-hierarchical-timeseries.png)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "!Important!
This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1k_demand_forecasting_with_pipeline_components/automl-forecasting-demand-hierarchical-timeseries-in-pipeline)).
" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -666,7 +673,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.8" + "version": "3.8.10" } }, "nbformat": 4, diff --git a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-many-models/auto-ml-forecasting-many-models.ipynb b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-many-models/auto-ml-forecasting-many-models.ipynb index ef122603a7..aab5043cb9 100644 --- a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-many-models/auto-ml-forecasting-many-models.ipynb +++ b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-many-models/auto-ml-forecasting-many-models.ipynb @@ -16,6 +16,13 @@ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-hierarchical-timeseries/auto-ml-forecasting-hierarchical-timeseries.png)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "!Important!
This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1k_demand_forecasting_with_pipeline_components/automl-forecasting-demand-many-models-in-pipeline)).
" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -306,7 +313,7 @@ "from azureml.core.compute import ComputeTarget, AmlCompute\n", "\n", "# Name your cluster\n", - "compute_name = \"mm-compute\"\n", + "compute_name = \"mm-compute-v1\"\n", "\n", "\n", "if compute_name in ws.compute_targets:\n", @@ -316,7 +323,7 @@ "else:\n", " print(\"Creating a new compute target...\")\n", " provisioning_config = AmlCompute.provisioning_configuration(\n", - " vm_size=\"STANDARD_D16S_V3\", max_nodes=20\n", + " vm_size=\"STANDARD_D14_V2\", max_nodes=20\n", " )\n", " # Create the compute target\n", " compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)\n", @@ -864,9 +871,9 @@ "automated-machine-learning" ], "kernelspec": { - "display_name": "Python 3.8.5 ('base')", + "display_name": "Python 3.8 - AzureML", "language": "python", - "name": "python3" + "name": "python38-azureml" }, "language_info": { "codemirror_mode": { @@ -878,7 +885,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.5" + "version": "3.8.10" }, "vscode": { "interpreter": { diff --git a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb index d770e2434c..0b1a7242cf 100644 --- a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb +++ b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb @@ -1,7 +1,6 @@ { "cells": [ { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -11,7 +10,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -19,7 +17,13 @@ ] }, { - "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "!Important!
This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-orange-juice-sales)).
" + ] + }, + { "cell_type": "markdown", "metadata": {}, "source": [ @@ -37,7 +41,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -50,7 +53,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -75,7 +77,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -92,7 +93,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -126,7 +126,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -166,7 +165,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -190,7 +188,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -211,7 +208,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -231,7 +227,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -264,7 +259,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -290,7 +284,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -307,7 +300,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -335,7 +327,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -374,7 +365,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -392,7 +382,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -466,7 +455,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -493,7 +481,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -513,7 +500,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -551,7 +537,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -572,7 +557,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -581,7 +565,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -610,7 +593,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -666,7 +648,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -674,7 +655,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -697,7 +677,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -717,7 +696,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -763,7 +741,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -812,7 +789,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -854,9 +830,9 @@ "friendly_name": "Forecasting orange juice sales with deployment", "index_order": 1, "kernelspec": { - "display_name": "Python 3.8.5 ('base')", + "display_name": "Python 3.8 - AzureML", "language": "python", - "name": "python3" + "name": "python38-azureml" }, "language_info": { "codemirror_mode": { @@ -868,7 +844,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.5" + "version": "3.8.10" }, "tags": [ "None" diff --git a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-pipelines/auto-ml-forecasting-pipelines.ipynb b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-pipelines/auto-ml-forecasting-pipelines.ipynb index 6277b7f013..90ff8bc552 100644 --- a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-pipelines/auto-ml-forecasting-pipelines.ipynb +++ b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-pipelines/auto-ml-forecasting-pipelines.ipynb @@ -1,5 +1,21 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "!Important!
This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/automl-forecasting-in-pipeline)).
\n", + "
\n", + "
\n", + "\n", + "For examples illustrating how to build pipelines with components, please use the following links:\n", + "" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -774,7 +790,7 @@ "friendly_name": "Forecasting orange juice sales with deployment", "index_order": 1, "kernelspec": { - "display_name": "Python 3.8.5 ('base')", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, diff --git a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-recipes-univariate/auto-ml-forecasting-univariate-recipe-experiment-settings.ipynb b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-recipes-univariate/auto-ml-forecasting-univariate-recipe-experiment-settings.ipynb index ca9e4b900f..edc21b8590 100644 --- a/v1/python-sdk/tutorials/automl-with-azureml/forecasting-recipes-univariate/auto-ml-forecasting-univariate-recipe-experiment-settings.ipynb +++ b/v1/python-sdk/tutorials/automl-with-azureml/forecasting-recipes-univariate/auto-ml-forecasting-univariate-recipe-experiment-settings.ipynb @@ -1,505 +1,512 @@ { - "cells": [ - { - "cell_type": "markdown", - "source": [ - "Copyright (c) Microsoft Corporation. All rights reserved.\n", - "\n", - "Licensed under the MIT License." - ], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-recipes-univariate/1_determine_experiment_settings.png)" - ], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "In this notebook we will explore the univariate time-series data to determine the settings for an automated ML experiment. We will follow the thought process depicted in the following diagram:
\n", - "![Forecasting after training](figures/univariate_settings_map_20210408.jpg)\n", - "\n", - "The objective is to answer the following questions:\n", - "\n", - "
    \n", - "
  1. Is there a seasonal pattern in the data?
  2. \n", - " \n", - "
  3. Is the data stationary?
  4. \n", - " \n", - "
  5. Is there a detectable auto-regressive pattern in the stationary data?
  6. \n", - " \n", - "
\n", - "\n", - "The answers to these questions will help determine the appropriate settings for the automated ML experiment.\n" - ], - "metadata": {} - }, - { - "cell_type": "code", - "source": [ - "import os\n", - "import warnings\n", - "import pandas as pd\n", - "\n", - "from statsmodels.graphics.tsaplots import plot_acf, plot_pacf\n", - "import matplotlib.pyplot as plt\n", - "from pandas.plotting import register_matplotlib_converters\n", - "\n", - "register_matplotlib_converters() # fixes the future warning issue\n", - "\n", - "from helper_functions import unit_root_test_wrapper\n", - "from statsmodels.tools.sm_exceptions import InterpolationWarning\n", - "\n", - "warnings.simplefilter(\"ignore\", InterpolationWarning)\n", - "\n", - "\n", - "# set printing options\n", - "pd.set_option(\"display.max_columns\", 500)\n", - "pd.set_option(\"display.width\", 1000)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} - }, - { - "cell_type": "code", - "source": [ - "# load data\n", - "main_data_loc = \"data\"\n", - "train_file_name = \"S4248SM144SCEN.csv\"\n", - "\n", - "TARGET_COLNAME = \"S4248SM144SCEN\"\n", - "TIME_COLNAME = \"observation_date\"\n", - "COVID_PERIOD_START = \"2020-03-01\"\n", - "\n", - "df = pd.read_csv(os.path.join(main_data_loc, train_file_name))\n", - "df[TIME_COLNAME] = pd.to_datetime(df[TIME_COLNAME], format=\"%Y-%m-%d\")\n", - "df.sort_values(by=TIME_COLNAME, inplace=True)\n", - "df.set_index(TIME_COLNAME, inplace=True)\n", - "df.head(2)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} - }, - { - "cell_type": "code", - "source": [ - "# plot the entire dataset\n", - "fig, ax = plt.subplots(figsize=(6, 2), dpi=180)\n", - "ax.plot(df)\n", - "ax.title.set_text(\"Original Data Series\")\n", - "locs, labels = plt.xticks()\n", - "plt.xticks(rotation=45)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "The graph plots the alcohol sales in the United States. Because the data is trending, it can be difficult to see cycles, seasonality or other interesting behaviors due to the scaling issues. For example, if there is a seasonal pattern, which we will discuss later, we cannot see them on the trending data. In such case, it is worth plotting the same data in first differences." - ], - "metadata": {} - }, - { - "cell_type": "code", - "source": [ - "# plot the entire dataset in first differences\n", - "fig, ax = plt.subplots(figsize=(6, 2), dpi=180)\n", - "ax.plot(df.diff().dropna())\n", - "ax.title.set_text(\"Data in first differences\")\n", - "locs, labels = plt.xticks()\n", - "plt.xticks(rotation=45)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "In the previous plot we observe that the data is more volatile towards the end of the series. This period coincides with the Covid-19 period, so we will exclude it from our experiment. Since in this example there are no user-provided features it is hard to make an argument that a model trained on the less volatile pre-covid data will be able to accurately predict the covid period." - ], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "# 1. Seasonality\n", - "\n", - "#### Questions that need to be answered in this section:\n", - "1. Is there a seasonality?\n", - "2. If it's seasonal, does the data exhibit a trend (up or down)?\n", - "\n", - "It is hard to visually detect seasonality when the data is trending. The reason being is scale of seasonal fluctuations is dwarfed by the range of the trend in the data. One way to deal with this is to de-trend the data by taking the first differences. We will discuss this in more detail in the next section." - ], - "metadata": {} - }, - { - "cell_type": "code", - "source": [ - "# plot the entire dataset in first differences\n", - "fig, ax = plt.subplots(figsize=(6, 2), dpi=180)\n", - "ax.plot(df.diff().dropna())\n", - "ax.title.set_text(\"Data in first differences\")\n", - "locs, labels = plt.xticks()\n", - "plt.xticks(rotation=45)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "For the next plot, we will exclude the Covid period again. We will also shorten the length of data because plotting a very long time series may prevent us from seeing seasonal patterns, if there are any, because the plot may look like a random walk." - ], - "metadata": {} - }, - { - "cell_type": "code", - "source": [ - "# remove COVID period\n", - "df = df[:COVID_PERIOD_START]\n", - "\n", - "# plot the entire dataset in first differences\n", - "fig, ax = plt.subplots(figsize=(6, 2), dpi=180)\n", - "ax.plot(df[\"2015-01-01\":].diff().dropna())\n", - "ax.title.set_text(\"Data in first differences\")\n", - "locs, labels = plt.xticks()\n", - "plt.xticks(rotation=45)" - ], - "outputs": [], - "execution_count": null, - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "

Conclusion

\n", - "\n", - "Visual examination does not suggest clear seasonal patterns. We will set the STL_TYPE = None, and we will move to the next section that examines stationarity. \n", - "\n", - "\n", - "Say, we are working with a different data set that shows clear patterns of seasonality, we have several options for setting the settings:is hard to say which option will work best in your case, hence you will need to run both options to see which one results in more accurate forecasts. \n", - "
    \n", - "
  1. If the data does not appear to be trending, set DIFFERENCE_SERIES=False, TARGET_LAGS=None and STL_TYPE = \"season\"
  2. \n", - "
  3. If the data appears to be trending, consider one of the following two settings:\n", - " \n", - "
" - ], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "# 2. Stationarity\n", - "If the data does not exhibit seasonal patterns, we would like to see if the data is non-stationary. Particularly, we want to see if there is a clear trending behavior. If such behavior is observed, we would like to first difference the data and examine the plot of an auto-correlation function (ACF) known as correlogram. If the data is seasonal, differencing it will not get rid off the seasonality and this will be shown on the correlogram as well.\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "#### Questions that need to be answered in this section:\n", - "
    \n", - "
  1. Is the data stationary?
  2. \n", - "
  3. Does the stationarized data (either the original or the differenced series) exhibit a clear auto-regressive pattern?
  4. \n", - "
\n", - "\n", - "To answer the first question, we run a series of tests (we call them unit root tests)." - ], - "metadata": {} - }, - { - "cell_type": "code", - "source": [ - "# unit root tests\n", - "test = unit_root_test_wrapper(df[TARGET_COLNAME])\n", - "print(\"---------------\", \"\\n\")\n", - "print(\"Summary table\", \"\\n\", test[\"summary\"], \"\\n\")\n", - "print(\"Is the {} series stationary?: {}\".format(TARGET_COLNAME, test[\"stationary\"]))\n", - "print(\"---------------\", \"\\n\")" - ], - "outputs": [], - "execution_count": null, - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "In the previous cell, we ran a series of unit root tests. The summary table contains the following columns:\n", - "\n", - "\n", - "Each of the tests shows that the original time series is non-stationary. The final decision is based on the majority rule. If, there is a split decision, the algorithm will claim it is stationary. We run a series of tests because each test by itself may not be accurate. In many cases when there are conflicting test results, the user needs to make determination if the series is stationary or not.\n", - "\n", - "Since we found the series to be non-stationary, we will difference it and then test if the differenced series is stationary." - ], - "metadata": {} - }, - { - "cell_type": "code", - "source": [ - "# unit root tests\n", - "test = unit_root_test_wrapper(df[TARGET_COLNAME].diff().dropna())\n", - "print(\"---------------\", \"\\n\")\n", - "print(\"Summary table\", \"\\n\", test[\"summary\"], \"\\n\")\n", - "print(\"Is the {} series stationary?: {}\".format(TARGET_COLNAME, test[\"stationary\"]))\n", - "print(\"---------------\", \"\\n\")" - ], - "outputs": [], - "execution_count": null, - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "Four out of five tests show that the series in first differences is stationary. Notice that this decision is not unanimous. Next, let's plot the original series in first-differences to illustrate the difference between non-stationary (unit root) process vs the stationary one." - ], - "metadata": {} - }, - { - "cell_type": "code", - "source": [ - "# plot original and stationary data\n", - "fig = plt.figure(figsize=(10, 10))\n", - "ax1 = fig.add_subplot(211)\n", - "ax1.plot(df[TARGET_COLNAME], \"-b\")\n", - "ax2 = fig.add_subplot(212)\n", - "ax2.plot(df[TARGET_COLNAME].diff().dropna(), \"-b\")\n", - "ax1.title.set_text(\"Original data\")\n", - "ax2.title.set_text(\"Data in first differences\")" - ], - "outputs": [], - "execution_count": null, - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "If you were asked a question \"What is the mean of the series before and after 2008?\", for the series titled \"Original data\" the mean values will be significantly different. This implies that the first moment of the series (in this case, it is the mean) is time dependent, i.e., mean changes depending on the interval one is looking at. Thus, the series is deemed to be non-stationary. On the other hand, for the series titled \"Data in first differences\" the means for both periods are roughly the same. Hence, the first moment is time invariant; meaning it does not depend on the interval of time one is looking at. In this example it is easy to visually distinguish between stationary and non-stationary data. Often this distinction is not easy to make, therefore we rely on the statistical tests described above to help us make an informed decision. " - ], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "

Conclusion

\n", - "Since we found the original process to be non-stationary (contains unit root), we will have to model the data in first differences. As a result, we will set the DIFFERENCE_SERIES parameter to True." - ], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "# 3 Check if there is a clear auto-regressive pattern\n", - "We need to determine if we should include lags of the target variable as features in order to improve forecast accuracy. To do this, we will examine the ACF and partial ACF (PACF) plots of the stationary series. In our case, it is a series in first differences.\n", - "\n", - "