From 1ea88dc676cd60e6a65292f394e63de63dbce5b2 Mon Sep 17 00:00:00 2001 From: Herman Obst Demaestri Date: Tue, 2 Mar 2021 14:24:44 -0300 Subject: [PATCH 1/6] Chapter 13 summary added, html updated --- 13_time_series/13_time_series.jl | 11 +- docs/13_time_series.jl.html | 1136 +++++++++++++++--------------- 2 files changed, 578 insertions(+), 569 deletions(-) diff --git a/13_time_series/13_time_series.jl b/13_time_series/13_time_series.jl index a100c7e0..e82a28fb 100644 --- a/13_time_series/13_time_series.jl +++ b/13_time_series/13_time_series.jl @@ -1,5 +1,5 @@ ### A Pluto.jl notebook ### -# v0.12.20 +# v0.12.21 using Markdown using InteractiveUtils @@ -795,7 +795,14 @@ md"Well, good! We started this chapter not knowing how to tackle time series for As a final summary, when dealing with a time series it is very important to be able to define if it has any latent variables such as trend or seasonality. Once we can find that underlying information, we will be able to generate forecasts with confidence. We just need to look deeper. -### Bibliography +### Summary + +In this chapter we have learned the basic foundations of time series analysis. +We have defined what a time serie is and delve into a particular method: The Exponential Smoothing. +After building up an intuition of how the simple exponential smoothing works, we continued to introduce more complex versions of the method as the various problems we set ourselves required it. +The simple, trended and seasonality methods were presented and coded, generating in that way a much greater understanding of how they work. + +### References - [Forecasting: Principles and Practice, Chap 7](https://otexts.com/fpp2/expsmooth.html) diff --git a/docs/13_time_series.jl.html b/docs/13_time_series.jl.html index d11458a3..63536ec9 100644 --- a/docs/13_time_series.jl.html +++ b/docs/13_time_series.jl.html @@ -5,9 +5,9 @@ ⚡ Pluto.jl ⚡ - - - + + + -

Predicting the future

-
6.6 μs

Let's imagine this situation for a moment. We are sitting quietly in our house thinking about the beauty of data analysis, when an old friend calls us: Terry. We haven't seen him for a long time, but he tells us that he is forming an important investment fund for which he needs to get some experts in statistics.

+

Predicting the future

+
5.7 μs

Let's imagine this situation for a moment. We are sitting quietly in our house thinking about the beauty of data analysis, when an old friend calls us: Terry. We haven't seen him for a long time, but he tells us that he is forming an important investment fund for which he needs to get some experts in statistics.

The idea is simple, he needs to generate predictive models of the supply and demand of certain commodities so that later, with this precious information, he can determine if the prices of those products are going to go down or up. Then you simply buy long or short positions (you don't need to know much about that, basically you are betting on that prices are going to go up or down, respectively) on the stock market, hoping to make a lot of money.

With respect to the data, what you have available are long series with all the values that were taken by the supply and demand over time. In other words, we have time series.

This type of data has the particularity that its values are correlated since they are values of the same variable that changes over time. For example, the total amount of energy demanded by a nation today is closely related to yesterday's demand, and so on.

For a moment we are paralyzed and realize that we never encounter this type of problem. We take a breath for a few seconds and remember everything we learned about Bayesianism, about neural networks, about dynamic systems. It may be complex, but we will certainly succeed.

We tell Terry we're in. He smiles and gives us the first series: Peanut Chocolate

-
13.3 μs
12.2 s
+
27.8 μs
12.5 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
10.9 s

As we just said: Notice that the variables they are asking us to work with are values that are evolving over time, so they take the name of Time Series. This kind of variables are the ones we are going to deal with in this chapter and have the particularity that, as they evolve in time, the values they take are related to the previous ones. So can you think of something to solve the problem of predicting the next value the series is going to have?

+
9.3 s

As we just said: Notice that the variables they are asking us to work with are values that are evolving over time, so they take the name of Time Series. This kind of variables are the ones we are going to deal with in this chapter and have the particularity that, as they evolve in time, the values they take are related to the previous ones. So can you think of something to solve the problem of predicting the next value the series is going to have?

Many ideas must be coming to your mind. Surely some of them thought of taking as a forecast value the average of all the previous ones and some others have thought of taking directly the last value of the series, justifying that the very old values do not affect the next ones. That is to say:

-
1.4 ms

yT+1|T=1Tt=1Tyt

+
1.5 ms

yT+1|T=1Tt=1Tyt

Where T is the number of periods for which we have data. Or:

-
5.8 μs

yT+1|T=yT

-
3.4 μs

Where we would always take the last value in the series to predict the next.

+
7.1 μs

yT+1|T=yT

+
3.3 μs

Where we would always take the last value in the series to predict the next.

If we observe this carefully, we might realize that these are two extreme cases of the same methodology: assigning "weights" to the previous values to predict the next one. Basically this is the way we have to indicate how much we are interested in the old observations and how much in the new ones.

In the case of simple averaging, we would be saying that we care exactly the same about all the observations, since from the first to the last they are all multiplied by the same weights:

yT+1|T=1Ty1+1Ty2+...+1TyT

@@ -291,560 +291,560 @@

Exponential Smoothing

A very interesting idea to implement our idea of finding a method that allows us to take into account the most distant values, but assigning them a lesser weight, is that of the Exponential Smoothing.

The name may sound very complicated or crazy to you. The reality is that it is very simple. Basically, what we propose is to assign weights that are decreasing exponentially as the observations are getting older, getting to give a preponderant value to the closest values, but without resigning the valuable information that the previous values offer us:

-

yT+1|T=αyT+α(1α)yT1+α(1α)2yT2+...+α(1α)T1y1

+

yT+1|T=αyT+α(1α)yT1+α(1α)2yT2+...+α(1α)T1y1

That is:

-

yT+1|T=i=0T1α(1α)jyTi

+

yT+1|T=i=0T1α(1α)jyTi

Watch these two formulas for a while until you make sure you understand that they are the same :)

This way of writing the method is especially useful because it allows us to regulate how much weight we want to assign to the past values. What does this mean? That we can control how quickly the value of the weights will decline over time. As a general rule, with alpha values close to 1, a lot of weight will be given to the close values and the weights will decay very quickly, and with alpha getting close to 0 the decay will be smoother. Let´s see it:

-
11.5 μs
5.0 μs
+
15.9 μs
5.8 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + - - - - - - -
3.7 s

As we said, as alpha gets closer to one, the importance of the closer values is greater and greater and when it gets closer to zero the opposite happens, that is, the importance is "more distributed" among all the observations. For example, in the case of choosing alpha = 0.5 we would have:

+ + + + +
3.5 s

As we said, as alpha gets closer to one, the importance of the closer values is greater and greater and when it gets closer to zero the opposite happens, that is, the importance is "more distributed" among all the observations. For example, in the case of choosing alpha = 0.5 we would have:

yT+1|T=0.5yT+0.25yT1+0.125yT2+0.0625yT3+...

Well, then it seems we have a method that is good for estimating this kind of time series. But you're probably wondering... How do we choose the optimal alpha?

The basic strategy is to go through the entire time serie predicting each of the values it has with the previous ones and then look for the alpha that minimizes the difference between them. Let's see it: For this, we have to introduce two new ways of writing the same method.

Weighted average and Component form

Another way to write the same method and that will help us later with the choice of the alpha that best describes our data series is the Weighted average form. It simply proposes that the next value is a weighted average between the last value in the time series and the last prediction made:

-

yt+1|t=αyt+(1α)yt|t1

+

yt+1|t=αyt+(1α)yt|t1

Notice that now the sub-index is changed from T to t, denoting that we are referring to any point of the time series and that, to calculate the prediction we are going to use the previous points of the series.

This is really very useful, since we can in this way go through all the time series and generate predictions for each point of it:

First, defining the prediction for the first value of the series as

y1|0=lo

-

y2|1=αy1+(1α)lo

-

y3|2=αy2+(1α)y2|1

+

y2|1=αy1+(1α)lo

+

y3|2=αy2+(1α)y2|1

...

-

yT+1|T=αyT+(1α)yT|T1

+

yT+1|T=αyT+(1α)yT|T1

And if we substitute each equation within the other, we get the equation for prediction:

-

y3|2=αy1+(1α)y2|1

-

y3|2=αy1+(1α)(αy1+(1α)lo)

-

y3|2=αy1+(1α)αy1+(1α)2lo

+

y3|2=αy1+(1α)y2|1

+

y3|2=αy1+(1α)(αy1+(1α)lo)

+

y3|2=αy1+(1α)αy1+(1α)2lo

...

-

yT+1|T=i=0T1α(1α)jyTi+(1α)Tlo

+

yT+1|T=i=0T1α(1α)jyTi+(1α)Tlo

And as (1α)T decays exponentially, for not very high values of T this term already becomes zero. So we obtain the same predicting formula as before.

Well, this is very good! We already have a way to go through the whole time series and make the predictions. This will be very useful since we can make these predictions for different alpha values and observe which one best approximates the whole series. Once this optimal alpha is obtained in this fitting process, we will be able to make the prediction for the future period.

Finally, there is one last way to define our models called Component form

yt+h|t=lt

-

lt=αyt+(1α)lt1

+

lt=αyt+(1α)lt1

In the case of the Simple Exponential Smoothing it is identical to the Weighted average form, but it will make our work easier later when we want to make the analysis more complex.

Optimization (or Fitting) Process

Let's start by seeing how using different alpha values we obtain different predicted value curves, defining a function that will make us the predictions of each value in the time series given a starting point (l0) and an alpha:

-
17.7 μs
SES_weight (generic function with 1 method)
48.7 μs

Let's read the above algorithm together to make sure we understand that you are applying the formula explained above.

+
24.2 μs
SES_weight (generic function with 1 method)
38.7 μs

Let's read the above algorithm together to make sure we understand that you are applying the formula explained above.

The function SES_weight receives three parameters: The alpha to make the calculation, the value of the first lo prediction, and the time series in question.

The algorithm begins by obtaining the number of points that the time series has and defining a vector in which we will be depositing all the predicted values of each point of it. Then it begins to iterate on the time series applying the formula mentioned above:

-

For the first predicted value y1|0=ypred=lo and for the following ones we apply the formula yt+1|t=αyt+(1α)yt|t1. Makes sense, right?

+

For the first predicted value y1|0=ypred=lo and for the following ones we apply the formula yt+1|t=αyt+(1α)yt|t1. Makes sense, right?

Then let's get to work and see how our prediction curve fits the real data.

-
7.8 μs
3.7 μs
+
8.8 μs
3.8 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
295 ms

This is a very cool graphic to look at. It illustrates very clearly how small alphas look mostly like an average of the time series values and, as it starts to get closer to 1, it looks more like taking the last value in the series as a future prediction. As we said at the beginning of the chapter, this method allows us to find intermediate points between the two extremes.

+ +
275 ms

This is a very cool graphic to look at. It illustrates very clearly how small alphas look mostly like an average of the time series values and, as it starts to get closer to 1, it looks more like taking the last value in the series as a future prediction. As we said at the beginning of the chapter, this method allows us to find intermediate points between the two extremes.

It's very nice to see how the graphs change as the alpha does... but how do we find the best alpha and l0 so that the fit is the best?

Loss functions

Somehow we have to be able to quantify how close the predicted curve is to the actual data curve. A very elegant way to do it is defining an error function or loss function which will return, for a certain value of the parameters to be optimized, a global number that tells us just how similar both curves are.

@@ -856,93 +856,93 @@

Loss functions

SSE=i=1T(ytyt|t1)2=i=1Te2

SAE=i=1T|ytyt|t1|=i=1T|e|

Great! Now that we have a logical and quantitative methodology to determine how well our model is fitting the data, all that remains is to implement it. Let's go for it!

-
15.5 μs
SES_weight_loss (generic function with 1 method)
55.5 μs

As you can see in the code we take the sum of errors squared as an error function. This algorithm is very similar to the previous one (with which we obtained each predicted point) only that now instead of saving the value, we compute its residual with the real data and we add it to the variable "loss".

+
48.5 μs
SES_weight_loss (generic function with 1 method)
39.7 μs

As you can see in the code we take the sum of errors squared as an error function. This algorithm is very similar to the previous one (with which we obtained each predicted point) only that now instead of saving the value, we compute its residual with the real data and we add it to the variable "loss".

As a result we have a function that has as a parameter a time series, alpha and lo; and returns a number that is telling us the general error. Obviously that is the number we want to minimize.

Let's see how the function behaves if we leave fixed lo (actually it is a value that we have to find with the minimization, but we know that it has to be near to the first value of the series) and we change alpha.

-
6.7 μs
+
7.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - -
83.6 ms

It is really very subtle, but the error function is not strictly decreasing. In fact, somewhere between 0.8 and 0.9 the function starts to grow again, so the minimum is there.

+
232 ms

It is really very subtle, but the error function is not strictly decreasing. In fact, somewhere between 0.8 and 0.9 the function starts to grow again, so the minimum is there.

For this kind of problems Julia has a package to optimize functions:

-
4.6 μs
5.1 s
SES_loss_ (generic function with 2 methods)
25.9 μs
 * Status: success
+
3.5 μs
12.5 s
SES_loss_ (generic function with 2 methods)
24.5 μs
 * Status: success
 
  * Candidate solution
     Final objective value:     1.423677e+04
@@ -967,274 +967,274 @@ 

Loss functions

Algorithm: Fminbox with L-BFGS * Convergence measures - |x - x'| = 1.38e-10 ≰ 0.0e+00 - |x - x'|/|x'| = 3.09e-13 ≰ 0.0e+00 + |x - x'| = 0.00e+00 ≤ 0.0e+00 + |x - x'|/|x'| = 0.00e+00 ≤ 0.0e+00 |f(x) - f(x')| = 0.00e+00 ≤ 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00 - |g(x)| = 6.73e-10 ≤ 1.0e-08 + |g(x)| = 6.01e-07 ≰ 1.0e-08 * Work counters - Seconds run: 1 (vs limit Inf) - Iterations: 6 - f(x) calls: 347 - ∇f(x) calls: 347 -
1.6 s

To use this function more efficiently it is necessary to define a range for the parameters in which the algorithm will perform the search and also a starting point (obviously within that range).

+ Seconds run: 3 (vs limit Inf) + Iterations: 4 + f(x) calls: 171 + ∇f(x) calls: 171 +
2.8 s

To use this function more efficiently it is necessary to define a range for the parameters in which the algorithm will perform the search and also a starting point (obviously within that range).

Also, one trick to keep in mind is that this package accepts "univariate" functions, that is, the function you enter only has to have one parameter to optimize. This is not entirely true since, although only one parameter has to be passed, it can be a vector, so that several parameters can be optimized. This is why we define a wrapper function SES_loss_ that facilitates the calculation.

With everything ready, let's look for the values of alpha and lo that minimize our error function:

-
6.8 μs
optim
1.5 ms

And this is how we came to fit our model, obtaining the best parameters to try to predict the next sales of our beloved peanut chocolate

-
3.4 μs
SES_weight_forecast (generic function with 1 method)
43.8 μs

As you can see, the simple exponential smoothing method only gives us a forward value. In other words, it predicts a constant value into the future. This happens because these types of time series have no latent variables defined, such as trend or seasonality. These are variables that add information to the model and allow us to make different predictions for each time we want to predict.

+
5.7 μs
optim
2.9 ms

And this is how we came to fit our model, obtaining the best parameters to try to predict the next sales of our beloved peanut chocolate

+
4.3 μs
SES_weight_forecast (generic function with 1 method)
58.2 μs

As you can see, the simple exponential smoothing method only gives us a forward value. In other words, it predicts a constant value into the future. This happens because these types of time series have no latent variables defined, such as trend or seasonality. These are variables that add information to the model and allow us to make different predictions for each time we want to predict.

But do not worry about that for now, we will study it well in a short time. For now, let's see how the prediction would look like.

-
5.0 μs
forecast
2.5 μs
+
8.6 μsforecast
3.1 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
319 ms

Perfect! We already have an initial model to attack problems where there is a lot of variability. But this is not always the case.

+ +
633 ms

Perfect! We already have an initial model to attack problems where there is a lot of variability. But this is not always the case.

That same night we talked to Terry, showed him the progress and he loves the direction we are going. He tells us that he has another time series to analyze and that he has a different behavior than the previous one, apparently this one shows a trend in the values...

Let's see:

Trend Methods

Now that we have built up an intuition of how Simple Exponential Smoothing logic works, wouldn't it be good to have some additional method that allows us to learn if our time series has a certain tendency?

But what does this mean in the first place? It means that there are some processes that inherently, by their nature, follow a marked tendency beyond random noise

For example, if we take the amount of AirPassenger in Australia from 1990 to 2016, we find this graph:

-
9.3 μs
1.6 μs
+
22.4 μs
2.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - -
53.3 ms

In this type of problem, it would not make any sense for all the predictions to be constant. Here is more latent information that we can get from the data: The trend.

+
313 ms

In this type of problem, it would not make any sense for all the predictions to be constant. Here is more latent information that we can get from the data: The trend.

This will be key since, once obtained, it will allow us to generate new values as we want to make more distant forecasts in time.

But how do we include this in our exponential smoothing model?

Holt’s linear trend method

This method for making predictions with time series that have a trend consists of a prediction equation and two smoothing equations, one to determine the level and another for the slope (or trend):

yT+h|T=lt+hbt

-

lt=αyt+(1α)(lt1+bt1)

-

bt=β(ltlt1)+(1β)bt1

+

lt=αyt+(1α)(lt1+bt1)

+

bt=β(ltlt1)+(1β)bt1

In this way, the equation for making predictions is simply to take the last predicted level (up to here it is equal to the Simple Exponential Smoothing) and add as many "slopes" as periods ahead we want to predict. The value of the slope to make the forecast is also the last predicted one.

As you can see, the values of alpha and beta are going to weigh each of the Smoothing equations.

On the one hand, alpha will weight the values of the previous real observation yt1 and the predicted one using the prediction equation with the values of l and b above, to obtain the actual value of the level lt

On the other hand, the beta value will be telling us how much we are going to let the value of the slope be modified. This value has the function of weighting between the current slope found ltlt1 against the estimated slope in the previous point, to calculate the estimation of the slope in the current period. In this way, small beta values indicate that the slope is unlikely to change over time, and high values allow the slope to change freely (the value of the "current" slope ltlt1 becomes preponderant in the estimation).

With this method, then, forecasts stop being flat to become trended. With this idea in mind, let's translate the math into code again:

-
12.1 μs
HLT_loss (generic function with 1 method)
42.5 μs

This function is doing exactly the same as SES_weight_loss, which we defined earlier. It is good to clarify that, like it, the method needs an initial slope to make the estimation of the first value of the time series. Let's see which parameters optimize the model with the data we have!

-
5.0 μs
HLT_loss_ (generic function with 2 methods)
32.4 μs
 * Status: success
+
23.7 μs
HLT_loss (generic function with 1 method)
68.2 μs

This function is doing exactly the same as SES_weight_loss, which we defined earlier. It is good to clarify that, like it, the method needs an initial slope to make the estimation of the first value of the time series. Let's see which parameters optimize the model with the data we have!

+
7.8 μs
HLT_loss_ (generic function with 2 methods)
41.2 μs
 * Status: success
 
  * Candidate solution
     Final objective value:     1.284222e+02
@@ -1243,141 +1243,141 @@ 

Holt’s linear trend method

Algorithm: Fminbox with L-BFGS * Convergence measures - |x - x'| = 1.88e-09 ≰ 0.0e+00 - |x - x'|/|x'| = 1.17e-10 ≰ 0.0e+00 + |x - x'| = 0.00e+00 ≤ 0.0e+00 + |x - x'|/|x'| = 0.00e+00 ≤ 0.0e+00 |f(x) - f(x')| = 0.00e+00 ≤ 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00 - |g(x)| = 2.35e-09 ≤ 1.0e-08 + |g(x)| = 1.28e+02 ≰ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) - Iterations: 4 - f(x) calls: 456 - ∇f(x) calls: 456 -
123 ms

As with the "SES" we define a wrapper function to be able to perform the optimization. The minimum and maximum values of lo are obtained with the same criterion: The optimal value has to be close to the first value of the time series, since it is going to be its estimation. With respect to the minimum and maximum value of the slope the criterion is again looking into the data: If you see the graph of the time series you can clearly see that the slope has to be between 1 and 3 (you can see that every 5 years that pass, the number of passengers increases by 10, more or less).

+ Iterations: 6 + f(x) calls: 936 + ∇f(x) calls: 936 +
208 ms

As with the "SES" we define a wrapper function to be able to perform the optimization. The minimum and maximum values of lo are obtained with the same criterion: The optimal value has to be close to the first value of the time series, since it is going to be its estimation. With respect to the minimum and maximum value of the slope the criterion is again looking into the data: If you see the graph of the time series you can clearly see that the slope has to be between 1 and 3 (you can see that every 5 years that pass, the number of passengers increases by 10, more or less).

Let´s see the optimal values for the parameters:

-
3.9 μs
optim1
2.9 μs

Perfect! Now that we have the optimal parameters to perform the forecast, we just need to define a function that performs it. For example:

-
2.8 μs
HLT_forecast (generic function with 1 method)
82.5 μs

As you can see in the function, the first part of it (the first for) goes through the entire time series using the parameters already optimized and making the best predictions for each point.

+
4.8 μs
optim1
4.0 μs

Perfect! Now that we have the optimal parameters to perform the forecast, we just need to define a function that performs it. For example:

+
2.8 μs
HLT_forecast (generic function with 1 method)
76.1 μs

As you can see in the function, the first part of it (the first for) goes through the entire time series using the parameters already optimized and making the best predictions for each point.

Then, when we reach the end of the time series, the second "for" begins (it will iterate the amount of periods we want to predict, value that we enter as "n_pred") to make now forecasts of periods that have not yet happened. To do this, it simply uses the last "level" that was estimated for the last value of the time series, and adds up as many slopes as periods we want: ypred=lt+bti

Finally, it returns a concatenation of the time series plus the values we ask it to predict.

-
5.5 μs
data_forecasted
6.7 μs
+
4.7 μsdata_forecasted
1.5 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
89.4 ms

And so it is. We already built a tool that allows us to make predictions for variables that show a trend.

+ +
214 ms

And so it is. We already built a tool that allows us to make predictions for variables that show a trend.

But surely you are thinking that assuming that the trend is going to be maintained during all the years that we are forecasting is a bit excessive? And it's true that it is.

It is known that this type of methods usually overestimate the values of the variable to predict, exactly because they suppose that the tendency continues.

A improvement of this method that helps to deal with this problem it is the Damped trend methods. Basically, what it does is add a coefficient that flattens the curve as we want to make more distant predictions in time. This improvement makes better predictions than the common trend methods, leaving the formulas as:

-

yT+h|T=lt+(ϕ+ϕ2+...+ϕh)bt

-

lt=αyt+(1α)(lt1+ϕbt1)

-

bt=β(ltlt1)+(1β)ϕbt1

+

yT+h|T=lt+(ϕ+ϕ2+...+ϕh)bt

+

lt=αyt+(1α)(lt1+ϕbt1)

+

bt=β(ltlt1)+(1β)ϕbt1

In this way, at the time of making the predictions and as we want to make them over more distant periods, instead of adding a unit of b_t to each one, we add a smaller fraction each time until the sum is practically constant.

For example, before, with the simple trend method, if we want to estimate the value of the next period the count to be done was (assuming, for example, that lt and bt in the last period of the time series were worth 300 and 1.5):

yT+1|T=300+1.5=301.5

@@ -1387,218 +1387,218 @@

Holt’s linear trend method

On the other hand, when we use the "damped method" the fraction of bt that we are adding falls exponentially, since we are multiplying it by a number less than 1 and that is rising to an increasing power. Following the same example, and adding a damping parameter equal to 0.9, we would obtain:

yT+1|T=300+0.91.5=300+1.35=301.35

yT+2|T=300+0.91.5+0.921.5=300+0.91.5+0.811.5=302.565

-

The process continues in this way until the value of ϕt is zero, that is to say that although it continues going further in the periods, to the result of the forecast no longer is added any significant term. This is why it is said that the damped method tends to make flat forecasts in the long term.

+

The process continues in this way until the value of ϕt is zero, that is to say that although it continues going further in the periods, to the result of the forecast no longer is added any significant term. This is why it is said that the damped method tends to make flat forecasts in the long term.

Finally, let's note that the damping parameter takes values between 1 and 0. being completely identical to the simple trend method for a value of 1 and completely flat for a value of 0. Let´s see:

-
10.5 μs
Damped_HLT_forecast (generic function with 1 method)
57.3 μs
damped_forecast
5.8 μs
+
9.3 μs
Damped_HLT_forecast (generic function with 1 method)
73.5 μs
damped_forecast
5.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1.7 ms

Incredible! Terry calls us and tells us that using the "damped" model was key to a trade he made. He is very excited about our work and gives us one last challenge.

+ +
2.8 ms

Incredible! Terry calls us and tells us that using the "damped" model was key to a trade he made. He is very excited about our work and gives us one last challenge.

He tells us that he is trying to enter more complex markets, particularly tourism, but that he doesn't understand how to approach this type of series since they show a lot of variability. For example, we start analysing the visitor nights in Australia spent by international tourists:

-
4.4 μs
3.4 μs
+
3.4 μs
5.8 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - -
27.7 ms

As you can see, the series shows a lot of variability. The values go up and down constantly.

+
264 ms

As you can see, the series shows a lot of variability. The values go up and down constantly.

After being a long time with Terry looking for ideas to address this type of data we realize that these ups and downs are not random, indeed, the form is repeated year after year! We realize that we are facing a problem with seasonality.

Seasonality Methods

For this type of model, in addition to the level and trend parameters, it is necessary to add another component that captures the time of year (actually it can be any other period in which seasonality occurs) in which we are and somehow influences the predicted outcome.

@@ -1616,268 +1616,270 @@

Holt-Winters’ seasonal additive method

For this method, we will need to add a smoothing equation for the seasonality parameters. In addition, we will have to indicate the frequency with which the seasonality occurs in the analyzed period and denote it with the letter m. For example, for a monthly seasonality m=12 and for a semester one m=2.

So, the model becomes like this:

yT+h|T=lt+hbt+st+hm(k+1)

-

lt=α(ytstm)+(1α)(lt1+bt1)

-

bt=β(ltlt1)+(1β)bt1

-

st=γ(ytlt1bt1)+(1γ)stm

+

lt=α(ytstm)+(1α)(lt1+bt1)

+

bt=β(ltlt1)+(1β)bt1

+

st=γ(ytlt1bt1)+(1γ)stm

As we had already anticipated, the seasonality term is added to the forecast equation. The term k is the integer part of (h1)/m and its function is to ensure that we always make predictions with the values of the parameters in the last year of the time series we have as data.

The level equation is still a weighted average, only now the seasonal component is added to the observation (ytstm). The other part of the average is the non-seasonal forecast (lt1+bt1).

The trend equation remains the same, and the seasonality equation also represents a weighted average between the current (ytlt1bt1) and previous year's index for the same season stm. This average is weighted by the γ parameter.

Now so, let's put these weird equations into code:

-
14.1 μs
HW_Seasonal (generic function with 1 method)
51.3 μs

What this code does is, given the optimal values of each parameter, it makes the prediction of the next point (h = 1) and stores them in the "pred" array. Staying a while analyzing how the algorithm is working is an excellent way to ensure full understanding of the method :)

+
11.4 μs
HW_Seasonal (generic function with 1 method)
82.3 μs

What this code does is, given the optimal values of each parameter, it makes the prediction of the next point (h = 1) and stores them in the "pred" array. Staying a while analyzing how the algorithm is working is an excellent way to ensure full understanding of the method :)

To obtain these parameters it is necessary to write the loss function as we have been doing in the previous ones. As a challenge for the reader, we propose to write this loss function using this one as a basis. To help you, you can look at the intimate relationship between the functions that are storing the predictions and the loss functions already written in the previous methods.

In this particular case, in which our data is quarterly, m = 4. Doing the same procedure as always to optimize the function is obtained:

-
5.6 μs
1.2 μs

It is interesting to stop and look at these values that we obtained.

-

First of all it is remarkable how, for the first time, the parameter α is taking a relatively low. This makes perfect sense, as the values are now much more connected to values further away than the immediate previous one, for example. They are connected precisely because of their seasonality.

+
3.9 μs
1.4 μs

It is interesting to stop and look at these values that we obtained.

+

First of all it is remarkable how, for the first time, the parameter α is taking a relatively low. This makes perfect sense, as the values are now much more connected to values further away than the immediate previous one, for example. They are connected precisely because of their seasonality.

It is also interesting to note how for the initial values of the seasonality not only one value is needed, but also 4. As a general case, as much as m, the frequency of the seasonality, will be needed. This can be seen as we now need to have a whole "year 0" to make the estimates for the first year of the time series.

Now, let's see the function in action:

-
6.3 μs
season_fitted
4.3 μs
+
5.1 μs
season_fitted
4.8 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
216 ms

As you can see, the fit is very good. It is also interesting how you can appreciate the exponential smooth in the step between the valley and the peak: It starts without having it and therefore the adjusted series neither, but as it appears, the model learns it and also starts to forecast it.

+ +
456 ms

As you can see, the fit is very good. It is also interesting how you can appreciate the exponential smooth in the step between the valley and the peak: It starts without having it and therefore the adjusted series neither, but as it appears, the model learns it and also starts to forecast it.

Excellent! Now that we have our model, it's time to use it and call Terry to tell him what actions to take in his trading strategy:

-
4.5 μs
HW_Seasonal_forecast (generic function with 1 method)
78.9 μs
season_forecast
4.8 μs
+
4.1 μs
HW_Seasonal_forecast (generic function with 1 method)
68.9 μs
season_forecast
4.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
130 ms

Well, good! We started this chapter not knowing how to tackle time series forecasting problems and ended up building a wide variety of models for different types of data, all while making our friend Terry a lot of money!

+ +
277 ms

Well, good! We started this chapter not knowing how to tackle time series forecasting problems and ended up building a wide variety of models for different types of data, all while making our friend Terry a lot of money!

As a final summary, when dealing with a time series it is very important to be able to define if it has any latent variables such as trend or seasonality. Once we can find that underlying information, we will be able to generate forecasts with confidence. We just need to look deeper.

-

Bibliography

+

Summary

+

In this chapter we have learned the basic foundations of time series analysis. We have defined what a time serie is and delve into a particular method: The Exponential Smoothing. After building up an intuition of how the simple exponential smoothing works, we continued to introduce more complex versions of the method as the various problems we set ourselves required it. The simple, trended and seasonality methods were presented and coded, generating in that way a much greater understanding of how they work.

+

References

-
2.3 ms
+14.6 μs From 23f50df880d8057832d53294c016c26456c89b78 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Wed, 10 Mar 2021 18:10:33 -0300 Subject: [PATCH 2/6] corrections --- 13_time_series/13_time_series.jl | 2 +- docs/13_time_series.jl.html | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/13_time_series/13_time_series.jl b/13_time_series/13_time_series.jl index e82a28fb..25ec0f1e 100644 --- a/13_time_series/13_time_series.jl +++ b/13_time_series/13_time_series.jl @@ -798,7 +798,7 @@ As a final summary, when dealing with a time series it is very important to be a ### Summary In this chapter we have learned the basic foundations of time series analysis. -We have defined what a time serie is and delve into a particular method: The Exponential Smoothing. +We have defined what a time series is and delve into a particular method: The exponential smoothing. After building up an intuition of how the simple exponential smoothing works, we continued to introduce more complex versions of the method as the various problems we set ourselves required it. The simple, trended and seasonality methods were presented and coded, generating in that way a much greater understanding of how they work. diff --git a/docs/13_time_series.jl.html b/docs/13_time_series.jl.html index 63536ec9..379246aa 100644 --- a/docs/13_time_series.jl.html +++ b/docs/13_time_series.jl.html @@ -1873,7 +1873,7 @@

Holt-Winters’ seasonal additive method

277 ms

Well, good! We started this chapter not knowing how to tackle time series forecasting problems and ended up building a wide variety of models for different types of data, all while making our friend Terry a lot of money!

As a final summary, when dealing with a time series it is very important to be able to define if it has any latent variables such as trend or seasonality. Once we can find that underlying information, we will be able to generate forecasts with confidence. We just need to look deeper.

Summary

-

In this chapter we have learned the basic foundations of time series analysis. We have defined what a time serie is and delve into a particular method: The Exponential Smoothing. After building up an intuition of how the simple exponential smoothing works, we continued to introduce more complex versions of the method as the various problems we set ourselves required it. The simple, trended and seasonality methods were presented and coded, generating in that way a much greater understanding of how they work.

+

In this chapter we have learned the basic foundations of time series analysis. We have defined what a time series is and delve into a particular method: The exponential smoothing. After building up an intuition of how the simple exponential smoothing works, we continued to introduce more complex versions of the method as the various problems we set ourselves required it. The simple, trended and seasonality methods were presented and coded, generating in that way a much greater understanding of how they work.

References

  • Forecasting: Principles and Practice, Chap 7

    From e9c7e2728719471cafee2f750b3055df092a2778 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Mon, 15 Mar 2021 11:17:47 -0300 Subject: [PATCH 3/6] corrections --- 13_time_series/13_time_series.jl | 9 +- docs/13_time_series.jl.html | 1090 +++++++++++++++--------------- 2 files changed, 550 insertions(+), 549 deletions(-) diff --git a/13_time_series/13_time_series.jl b/13_time_series/13_time_series.jl index 25ec0f1e..01b737b4 100644 --- a/13_time_series/13_time_series.jl +++ b/13_time_series/13_time_series.jl @@ -797,10 +797,11 @@ As a final summary, when dealing with a time series it is very important to be a ### Summary -In this chapter we have learned the basic foundations of time series analysis. -We have defined what a time series is and delve into a particular method: The exponential smoothing. -After building up an intuition of how the simple exponential smoothing works, we continued to introduce more complex versions of the method as the various problems we set ourselves required it. -The simple, trended and seasonality methods were presented and coded, generating in that way a much greater understanding of how they work. +In this chapter, we have learned the basic foundations of time series analysis. +We have defined what a time series is and delved into a particular method, the exponential smoothing, that allows us to take into account the most distant values of our data. +Finally, we explained more complex versions of the method and used them to make predictions in different kinds of scenarios. +When the processes followed a marked tendency, we used the trend method and the damped trend method to make long term predictions. +When the process was highly correlated with the seasonality of the year, like the quantity of air passengers in Australia, we utilized the Holt-Winters’ seasonal method. ### References diff --git a/docs/13_time_series.jl.html b/docs/13_time_series.jl.html index 379246aa..1d0d16df 100644 --- a/docs/13_time_series.jl.html +++ b/docs/13_time_series.jl.html @@ -149,137 +149,137 @@ -

    Predicting the future

    -
    5.7 μs

    Let's imagine this situation for a moment. We are sitting quietly in our house thinking about the beauty of data analysis, when an old friend calls us: Terry. We haven't seen him for a long time, but he tells us that he is forming an important investment fund for which he needs to get some experts in statistics.

    +

    Predicting the future

    +
    6.2 μs

    Let's imagine this situation for a moment. We are sitting quietly in our house thinking about the beauty of data analysis, when an old friend calls us: Terry. We haven't seen him for a long time, but he tells us that he is forming an important investment fund for which he needs to get some experts in statistics.

    The idea is simple, he needs to generate predictive models of the supply and demand of certain commodities so that later, with this precious information, he can determine if the prices of those products are going to go down or up. Then you simply buy long or short positions (you don't need to know much about that, basically you are betting on that prices are going to go up or down, respectively) on the stock market, hoping to make a lot of money.

    With respect to the data, what you have available are long series with all the values that were taken by the supply and demand over time. In other words, we have time series.

    This type of data has the particularity that its values are correlated since they are values of the same variable that changes over time. For example, the total amount of energy demanded by a nation today is closely related to yesterday's demand, and so on.

    For a moment we are paralyzed and realize that we never encounter this type of problem. We take a breath for a few seconds and remember everything we learned about Bayesianism, about neural networks, about dynamic systems. It may be complex, but we will certainly succeed.

    We tell Terry we're in. He smiles and gives us the first series: Peanut Chocolate

    -
    27.8 μs
    12.5 s
    +
    17.5 μs
    10.5 s
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    9.3 s

    As we just said: Notice that the variables they are asking us to work with are values that are evolving over time, so they take the name of Time Series. This kind of variables are the ones we are going to deal with in this chapter and have the particularity that, as they evolve in time, the values they take are related to the previous ones. So can you think of something to solve the problem of predicting the next value the series is going to have?

    +
    10.3 s

    As we just said: Notice that the variables they are asking us to work with are values that are evolving over time, so they take the name of Time Series. This kind of variables are the ones we are going to deal with in this chapter and have the particularity that, as they evolve in time, the values they take are related to the previous ones. So can you think of something to solve the problem of predicting the next value the series is going to have?

    Many ideas must be coming to your mind. Surely some of them thought of taking as a forecast value the average of all the previous ones and some others have thought of taking directly the last value of the series, justifying that the very old values do not affect the next ones. That is to say:

    -
    1.5 ms

    yT+1|T=1Tt=1Tyt

    +
    1.4 ms

    yT+1|T=1Tt=1Tyt

    Where T is the number of periods for which we have data. Or:

    -
    7.1 μs

    yT+1|T=yT

    -
    3.3 μs

    Where we would always take the last value in the series to predict the next.

    +
    7.0 μs

    yT+1|T=yT

    +
    4.5 μs

    Where we would always take the last value in the series to predict the next.

    If we observe this carefully, we might realize that these are two extreme cases of the same methodology: assigning "weights" to the previous values to predict the next one. Basically this is the way we have to indicate how much we are interested in the old observations and how much in the new ones.

    In the case of simple averaging, we would be saying that we care exactly the same about all the observations, since from the first to the last they are all multiplied by the same weights:

    yT+1|T=1Ty1+1Ty2+...+1TyT

    @@ -293,154 +293,154 @@

    Exponential Smoothing

    The name may sound very complicated or crazy to you. The reality is that it is very simple. Basically, what we propose is to assign weights that are decreasing exponentially as the observations are getting older, getting to give a preponderant value to the closest values, but without resigning the valuable information that the previous values offer us:

    yT+1|T=αyT+α(1α)yT1+α(1α)2yT2+...+α(1α)T1y1

    That is:

    -

    yT+1|T=i=0T1α(1α)jyTi

    +

    yT+1|T=i=0T1α(1α)jyTi

    Watch these two formulas for a while until you make sure you understand that they are the same :)

    This way of writing the method is especially useful because it allows us to regulate how much weight we want to assign to the past values. What does this mean? That we can control how quickly the value of the weights will decline over time. As a general rule, with alpha values close to 1, a lot of weight will be given to the close values and the weights will decay very quickly, and with alpha getting close to 0 the decay will be smoother. Let´s see it:

    -
    15.9 μs
    5.8 μs
    +
    13.7 μs
    5.3 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + - - - - - - -
    3.5 s

    As we said, as alpha gets closer to one, the importance of the closer values is greater and greater and when it gets closer to zero the opposite happens, that is, the importance is "more distributed" among all the observations. For example, in the case of choosing alpha = 0.5 we would have:

    + + + + +
    3.5 s

    As we said, as alpha gets closer to one, the importance of the closer values is greater and greater and when it gets closer to zero the opposite happens, that is, the importance is "more distributed" among all the observations. For example, in the case of choosing alpha = 0.5 we would have:

    yT+1|T=0.5yT+0.25yT1+0.125yT2+0.0625yT3+...

    Well, then it seems we have a method that is good for estimating this kind of time series. But you're probably wondering... How do we choose the optimal alpha?

    The basic strategy is to go through the entire time serie predicting each of the values it has with the previous ones and then look for the alpha that minimizes the difference between them. Let's see it: For this, we have to introduce two new ways of writing the same method.

    @@ -460,7 +460,7 @@

    Weighted average and Component form

    y3|2=αy1+(1α)(αy1+(1α)lo)

    y3|2=αy1+(1α)αy1+(1α)2lo

    ...

    -

    yT+1|T=i=0T1α(1α)jyTi+(1α)Tlo

    +

    yT+1|T=i=0T1α(1α)jyTi+(1α)Tlo

    And as (1α)T decays exponentially, for not very high values of T this term already becomes zero. So we obtain the same predicting formula as before.

    Well, this is very good! We already have a way to go through the whole time series and make the predictions. This will be very useful since we can make these predictions for different alpha values and observe which one best approximates the whole series. Once this optimal alpha is obtained in this fitting process, we will be able to make the prediction for the future period.

    Finally, there is one last way to define our models called Component form

    @@ -469,382 +469,382 @@

    Weighted average and Component form

    In the case of the Simple Exponential Smoothing it is identical to the Weighted average form, but it will make our work easier later when we want to make the analysis more complex.

    Optimization (or Fitting) Process

    Let's start by seeing how using different alpha values we obtain different predicted value curves, defining a function that will make us the predictions of each value in the time series given a starting point (l0) and an alpha:

    -
    24.2 μs
    SES_weight (generic function with 1 method)
    38.7 μs

    Let's read the above algorithm together to make sure we understand that you are applying the formula explained above.

    +
    18.7 μs
    SES_weight (generic function with 1 method)
    35.8 μs

    Let's read the above algorithm together to make sure we understand that you are applying the formula explained above.

    The function SES_weight receives three parameters: The alpha to make the calculation, the value of the first lo prediction, and the time series in question.

    The algorithm begins by obtaining the number of points that the time series has and defining a vector in which we will be depositing all the predicted values of each point of it. Then it begins to iterate on the time series applying the formula mentioned above:

    For the first predicted value y1|0=ypred=lo and for the following ones we apply the formula yt+1|t=αyt+(1α)yt|t1. Makes sense, right?

    Then let's get to work and see how our prediction curve fits the real data.

    -
    8.8 μs
    3.8 μs
    +
    11.8 μs
    3.5 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    275 ms

    This is a very cool graphic to look at. It illustrates very clearly how small alphas look mostly like an average of the time series values and, as it starts to get closer to 1, it looks more like taking the last value in the series as a future prediction. As we said at the beginning of the chapter, this method allows us to find intermediate points between the two extremes.

    + +
    244 ms

    This is a very cool graphic to look at. It illustrates very clearly how small alphas look mostly like an average of the time series values and, as it starts to get closer to 1, it looks more like taking the last value in the series as a future prediction. As we said at the beginning of the chapter, this method allows us to find intermediate points between the two extremes.

    It's very nice to see how the graphs change as the alpha does... but how do we find the best alpha and l0 so that the fit is the best?

    Loss functions

    Somehow we have to be able to quantify how close the predicted curve is to the actual data curve. A very elegant way to do it is defining an error function or loss function which will return, for a certain value of the parameters to be optimized, a global number that tells us just how similar both curves are.

    @@ -856,93 +856,93 @@

    Loss functions

    SSE=i=1T(ytyt|t1)2=i=1Te2

    SAE=i=1T|ytyt|t1|=i=1T|e|

    Great! Now that we have a logical and quantitative methodology to determine how well our model is fitting the data, all that remains is to implement it. Let's go for it!

    -
    48.5 μs
    SES_weight_loss (generic function with 1 method)
    39.7 μs

    As you can see in the code we take the sum of errors squared as an error function. This algorithm is very similar to the previous one (with which we obtained each predicted point) only that now instead of saving the value, we compute its residual with the real data and we add it to the variable "loss".

    +
    14.4 μs
    SES_weight_loss (generic function with 1 method)
    38.5 μs

    As you can see in the code we take the sum of errors squared as an error function. This algorithm is very similar to the previous one (with which we obtained each predicted point) only that now instead of saving the value, we compute its residual with the real data and we add it to the variable "loss".

    As a result we have a function that has as a parameter a time series, alpha and lo; and returns a number that is telling us the general error. Obviously that is the number we want to minimize.

    Let's see how the function behaves if we leave fixed lo (actually it is a value that we have to find with the minimization, but we know that it has to be near to the first value of the series) and we change alpha.

    -
    7.5 μs
    +
    7.6 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - -
    232 ms

    It is really very subtle, but the error function is not strictly decreasing. In fact, somewhere between 0.8 and 0.9 the function starts to grow again, so the minimum is there.

    +
    78.8 ms

    It is really very subtle, but the error function is not strictly decreasing. In fact, somewhere between 0.8 and 0.9 the function starts to grow again, so the minimum is there.

    For this kind of problems Julia has a package to optimize functions:

    -
    3.5 μs
    12.5 s
    SES_loss_ (generic function with 2 methods)
    24.5 μs
     * Status: success
    +
    4.4 μs
    4.7 s
    SES_loss_ (generic function with 2 methods)
    25.2 μs
     * Status: success
     
      * Candidate solution
         Final objective value:     1.423677e+04
    @@ -967,260 +967,260 @@ 

    Loss functions

    Algorithm: Fminbox with L-BFGS * Convergence measures - |x - x'| = 0.00e+00 ≤ 0.0e+00 - |x - x'|/|x'| = 0.00e+00 ≤ 0.0e+00 + |x - x'| = 1.38e-10 ≰ 0.0e+00 + |x - x'|/|x'| = 3.09e-13 ≰ 0.0e+00 |f(x) - f(x')| = 0.00e+00 ≤ 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00 - |g(x)| = 6.01e-07 ≰ 1.0e-08 + |g(x)| = 6.73e-10 ≤ 1.0e-08 * Work counters - Seconds run: 3 (vs limit Inf) - Iterations: 4 - f(x) calls: 171 - ∇f(x) calls: 171 -
    2.8 s

    To use this function more efficiently it is necessary to define a range for the parameters in which the algorithm will perform the search and also a starting point (obviously within that range).

    + Seconds run: 1 (vs limit Inf) + Iterations: 6 + f(x) calls: 347 + ∇f(x) calls: 347 +
    1.7 s

    To use this function more efficiently it is necessary to define a range for the parameters in which the algorithm will perform the search and also a starting point (obviously within that range).

    Also, one trick to keep in mind is that this package accepts "univariate" functions, that is, the function you enter only has to have one parameter to optimize. This is not entirely true since, although only one parameter has to be passed, it can be a vector, so that several parameters can be optimized. This is why we define a wrapper function SES_loss_ that facilitates the calculation.

    With everything ready, let's look for the values of alpha and lo that minimize our error function:

    -
    5.7 μs
    optim
    2.9 ms

    And this is how we came to fit our model, obtaining the best parameters to try to predict the next sales of our beloved peanut chocolate

    -
    4.3 μs
    SES_weight_forecast (generic function with 1 method)
    58.2 μs

    As you can see, the simple exponential smoothing method only gives us a forward value. In other words, it predicts a constant value into the future. This happens because these types of time series have no latent variables defined, such as trend or seasonality. These are variables that add information to the model and allow us to make different predictions for each time we want to predict.

    +
    5.5 μs
    optim
    1.5 ms

    And this is how we came to fit our model, obtaining the best parameters to try to predict the next sales of our beloved peanut chocolate

    +
    2.6 μs
    SES_weight_forecast (generic function with 1 method)
    51.6 μs

    As you can see, the simple exponential smoothing method only gives us a forward value. In other words, it predicts a constant value into the future. This happens because these types of time series have no latent variables defined, such as trend or seasonality. These are variables that add information to the model and allow us to make different predictions for each time we want to predict.

    But do not worry about that for now, we will study it well in a short time. For now, let's see how the prediction would look like.

    -
    8.6 μs
    forecast
    3.1 μs
    +
    4.3 μsforecast
    3.4 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    633 ms

    Perfect! We already have an initial model to attack problems where there is a lot of variability. But this is not always the case.

    + +
    304 ms

    Perfect! We already have an initial model to attack problems where there is a lot of variability. But this is not always the case.

    That same night we talked to Terry, showed him the progress and he loves the direction we are going. He tells us that he has another time series to analyze and that he has a different behavior than the previous one, apparently this one shows a trend in the values...

    Let's see:

    Trend Methods

    Now that we have built up an intuition of how Simple Exponential Smoothing logic works, wouldn't it be good to have some additional method that allows us to learn if our time series has a certain tendency?

    But what does this mean in the first place? It means that there are some processes that inherently, by their nature, follow a marked tendency beyond random noise

    For example, if we take the amount of AirPassenger in Australia from 1990 to 2016, we find this graph:

    -
    22.4 μs
    2.9 μs
    +
    7.3 μs
    1.7 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    313 ms

    In this type of problem, it would not make any sense for all the predictions to be constant. Here is more latent information that we can get from the data: The trend.

    +
    53.3 ms

    In this type of problem, it would not make any sense for all the predictions to be constant. Here is more latent information that we can get from the data: The trend.

    This will be key since, once obtained, it will allow us to generate new values as we want to make more distant forecasts in time.

    But how do we include this in our exponential smoothing model?

    Holt’s linear trend method

    @@ -1233,8 +1233,8 @@

    Holt’s linear trend method

    On the one hand, alpha will weight the values of the previous real observation yt1 and the predicted one using the prediction equation with the values of l and b above, to obtain the actual value of the level lt

    On the other hand, the beta value will be telling us how much we are going to let the value of the slope be modified. This value has the function of weighting between the current slope found ltlt1 against the estimated slope in the previous point, to calculate the estimation of the slope in the current period. In this way, small beta values indicate that the slope is unlikely to change over time, and high values allow the slope to change freely (the value of the "current" slope ltlt1 becomes preponderant in the estimation).

    With this method, then, forecasts stop being flat to become trended. With this idea in mind, let's translate the math into code again:

    -
    23.7 μs
    HLT_loss (generic function with 1 method)
    68.2 μs

    This function is doing exactly the same as SES_weight_loss, which we defined earlier. It is good to clarify that, like it, the method needs an initial slope to make the estimation of the first value of the time series. Let's see which parameters optimize the model with the data we have!

    -
    7.8 μs
    HLT_loss_ (generic function with 2 methods)
    41.2 μs
     * Status: success
    +
    11.5 μs
    HLT_loss (generic function with 1 method)
    41.5 μs

    This function is doing exactly the same as SES_weight_loss, which we defined earlier. It is good to clarify that, like it, the method needs an initial slope to make the estimation of the first value of the time series. Let's see which parameters optimize the model with the data we have!

    +
    3.9 μs
    HLT_loss_ (generic function with 2 methods)
    25.6 μs
     * Status: success
     
      * Candidate solution
         Final objective value:     1.284222e+02
    @@ -1243,135 +1243,135 @@ 

    Holt’s linear trend method

    Algorithm: Fminbox with L-BFGS * Convergence measures - |x - x'| = 0.00e+00 ≤ 0.0e+00 - |x - x'|/|x'| = 0.00e+00 ≤ 0.0e+00 + |x - x'| = 1.88e-09 ≰ 0.0e+00 + |x - x'|/|x'| = 1.17e-10 ≰ 0.0e+00 |f(x) - f(x')| = 0.00e+00 ≤ 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00 - |g(x)| = 1.28e+02 ≰ 1.0e-08 + |g(x)| = 2.35e-09 ≤ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) - Iterations: 6 - f(x) calls: 936 - ∇f(x) calls: 936 -
    208 ms

    As with the "SES" we define a wrapper function to be able to perform the optimization. The minimum and maximum values of lo are obtained with the same criterion: The optimal value has to be close to the first value of the time series, since it is going to be its estimation. With respect to the minimum and maximum value of the slope the criterion is again looking into the data: If you see the graph of the time series you can clearly see that the slope has to be between 1 and 3 (you can see that every 5 years that pass, the number of passengers increases by 10, more or less).

    + Iterations: 4 + f(x) calls: 456 + ∇f(x) calls: 456 +
    160 ms

    As with the "SES" we define a wrapper function to be able to perform the optimization. The minimum and maximum values of lo are obtained with the same criterion: The optimal value has to be close to the first value of the time series, since it is going to be its estimation. With respect to the minimum and maximum value of the slope the criterion is again looking into the data: If you see the graph of the time series you can clearly see that the slope has to be between 1 and 3 (you can see that every 5 years that pass, the number of passengers increases by 10, more or less).

    Let´s see the optimal values for the parameters:

    -
    4.8 μs
    optim1
    4.0 μs

    Perfect! Now that we have the optimal parameters to perform the forecast, we just need to define a function that performs it. For example:

    -
    2.8 μs
    HLT_forecast (generic function with 1 method)
    76.1 μs

    As you can see in the function, the first part of it (the first for) goes through the entire time series using the parameters already optimized and making the best predictions for each point.

    +
    3.5 μs
    optim1
    7.1 μs

    Perfect! Now that we have the optimal parameters to perform the forecast, we just need to define a function that performs it. For example:

    +
    3.0 μs
    HLT_forecast (generic function with 1 method)
    53.9 μs

    As you can see in the function, the first part of it (the first for) goes through the entire time series using the parameters already optimized and making the best predictions for each point.

    Then, when we reach the end of the time series, the second "for" begins (it will iterate the amount of periods we want to predict, value that we enter as "n_pred") to make now forecasts of periods that have not yet happened. To do this, it simply uses the last "level" that was estimated for the last value of the time series, and adds up as many slopes as periods we want: ypred=lt+bti

    Finally, it returns a concatenation of the time series plus the values we ask it to predict.

    -
    4.7 μs
    data_forecasted
    1.5 ms
    +
    4.7 μsdata_forecasted
    792 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    214 ms

    And so it is. We already built a tool that allows us to make predictions for variables that show a trend.

    + +
    71.7 ms

    And so it is. We already built a tool that allows us to make predictions for variables that show a trend.

    But surely you are thinking that assuming that the trend is going to be maintained during all the years that we are forecasting is a bit excessive? And it's true that it is.

    It is known that this type of methods usually overestimate the values of the variable to predict, exactly because they suppose that the tendency continues.

    A improvement of this method that helps to deal with this problem it is the Damped trend methods. Basically, what it does is add a coefficient that flattens the curve as we want to make more distant predictions in time. This improvement makes better predictions than the common trend methods, leaving the formulas as:

    @@ -1389,216 +1389,216 @@

    Holt’s linear trend method

    yT+2|T=300+0.91.5+0.921.5=300+0.91.5+0.811.5=302.565

    The process continues in this way until the value of ϕt is zero, that is to say that although it continues going further in the periods, to the result of the forecast no longer is added any significant term. This is why it is said that the damped method tends to make flat forecasts in the long term.

    Finally, let's note that the damping parameter takes values between 1 and 0. being completely identical to the simple trend method for a value of 1 and completely flat for a value of 0. Let´s see:

    -
    9.3 μs
    Damped_HLT_forecast (generic function with 1 method)
    73.5 μs
    damped_forecast
    5.5 μs
    +
    12.1 μs
    Damped_HLT_forecast (generic function with 1 method)
    57.0 μs
    damped_forecast
    5.8 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    2.8 ms

    Incredible! Terry calls us and tells us that using the "damped" model was key to a trade he made. He is very excited about our work and gives us one last challenge.

    + +
    1.6 ms

    Incredible! Terry calls us and tells us that using the "damped" model was key to a trade he made. He is very excited about our work and gives us one last challenge.

    He tells us that he is trying to enter more complex markets, particularly tourism, but that he doesn't understand how to approach this type of series since they show a lot of variability. For example, we start analysing the visitor nights in Australia spent by international tourists:

    -
    3.4 μs
    5.8 μs
    +
    4.6 μs
    3.8 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - -
    264 ms

    As you can see, the series shows a lot of variability. The values go up and down constantly.

    +
    27.0 ms

    As you can see, the series shows a lot of variability. The values go up and down constantly.

    After being a long time with Terry looking for ideas to address this type of data we realize that these ups and downs are not random, indeed, the form is repeated year after year! We realize that we are facing a problem with seasonality.

    Seasonality Methods

    For this type of model, in addition to the level and trend parameters, it is necessary to add another component that captures the time of year (actually it can be any other period in which seasonality occurs) in which we are and somehow influences the predicted outcome.

    @@ -1617,269 +1617,269 @@

    Holt-Winters’ seasonal additive method

    So, the model becomes like this:

    yT+h|T=lt+hbt+st+hm(k+1)

    lt=α(ytstm)+(1α)(lt1+bt1)

    -

    bt=β(ltlt1)+(1β)bt1

    +

    bt=β(ltlt1)+(1β)bt1

    st=γ(ytlt1bt1)+(1γ)stm

    As we had already anticipated, the seasonality term is added to the forecast equation. The term k is the integer part of (h1)/m and its function is to ensure that we always make predictions with the values of the parameters in the last year of the time series we have as data.

    The level equation is still a weighted average, only now the seasonal component is added to the observation (ytstm). The other part of the average is the non-seasonal forecast (lt1+bt1).

    The trend equation remains the same, and the seasonality equation also represents a weighted average between the current (ytlt1bt1) and previous year's index for the same season stm. This average is weighted by the γ parameter.

    Now so, let's put these weird equations into code:

    -
    11.4 μs
    HW_Seasonal (generic function with 1 method)
    82.3 μs

    What this code does is, given the optimal values of each parameter, it makes the prediction of the next point (h = 1) and stores them in the "pred" array. Staying a while analyzing how the algorithm is working is an excellent way to ensure full understanding of the method :)

    +
    11.5 μs
    HW_Seasonal (generic function with 1 method)
    47.4 μs

    What this code does is, given the optimal values of each parameter, it makes the prediction of the next point (h = 1) and stores them in the "pred" array. Staying a while analyzing how the algorithm is working is an excellent way to ensure full understanding of the method :)

    To obtain these parameters it is necessary to write the loss function as we have been doing in the previous ones. As a challenge for the reader, we propose to write this loss function using this one as a basis. To help you, you can look at the intimate relationship between the functions that are storing the predictions and the loss functions already written in the previous methods.

    In this particular case, in which our data is quarterly, m = 4. Doing the same procedure as always to optimize the function is obtained:

    -
    3.9 μs
    1.4 μs

    It is interesting to stop and look at these values that we obtained.

    +
    4.0 μs
    787 ns

    It is interesting to stop and look at these values that we obtained.

    First of all it is remarkable how, for the first time, the parameter α is taking a relatively low. This makes perfect sense, as the values are now much more connected to values further away than the immediate previous one, for example. They are connected precisely because of their seasonality.

    It is also interesting to note how for the initial values of the seasonality not only one value is needed, but also 4. As a general case, as much as m, the frequency of the seasonality, will be needed. This can be seen as we now need to have a whole "year 0" to make the estimates for the first year of the time series.

    Now, let's see the function in action:

    -
    5.1 μs
    season_fitted
    4.8 μs
    +
    5.0 μs
    season_fitted
    5.0 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    456 ms

    As you can see, the fit is very good. It is also interesting how you can appreciate the exponential smooth in the step between the valley and the peak: It starts without having it and therefore the adjusted series neither, but as it appears, the model learns it and also starts to forecast it.

    + +
    203 ms

    As you can see, the fit is very good. It is also interesting how you can appreciate the exponential smooth in the step between the valley and the peak: It starts without having it and therefore the adjusted series neither, but as it appears, the model learns it and also starts to forecast it.

    Excellent! Now that we have our model, it's time to use it and call Terry to tell him what actions to take in his trading strategy:

    -
    4.1 μs
    HW_Seasonal_forecast (generic function with 1 method)
    68.9 μs
    season_forecast
    4.9 μs
    +
    3.5 μs
    HW_Seasonal_forecast (generic function with 1 method)
    60.9 μs
    season_forecast
    6.5 μs
    - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    277 ms

    Well, good! We started this chapter not knowing how to tackle time series forecasting problems and ended up building a wide variety of models for different types of data, all while making our friend Terry a lot of money!

    + +
    129 ms

    Well, good! We started this chapter not knowing how to tackle time series forecasting problems and ended up building a wide variety of models for different types of data, all while making our friend Terry a lot of money!

    As a final summary, when dealing with a time series it is very important to be able to define if it has any latent variables such as trend or seasonality. Once we can find that underlying information, we will be able to generate forecasts with confidence. We just need to look deeper.

    Summary

    -

    In this chapter we have learned the basic foundations of time series analysis. We have defined what a time series is and delve into a particular method: The exponential smoothing. After building up an intuition of how the simple exponential smoothing works, we continued to introduce more complex versions of the method as the various problems we set ourselves required it. The simple, trended and seasonality methods were presented and coded, generating in that way a much greater understanding of how they work.

    +

    In this chapter, we have learned the basic foundations of time series analysis. We have defined what a time series is and delved into a particular method, the exponential smoothing, that allows us to take into account the most distant values of our data. Finally, we explained more complex versions of the method and used them to make predictions in different kinds of scenarios. When the processes followed a marked tendency, we used the trend method and the damped trend method to make long term predictions. When the process was highly correlated with the seasonality of the year, like the quantity of air passengers in Australia, we utilized the Holt-Winters’ seasonal method.

    References

    -
    14.6 μs
    +
27.5 μs
From 330c766c57359be37dd2e4f9596d57bcc9ebf15a Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Thu, 18 Mar 2021 16:35:34 -0300 Subject: [PATCH 4/6] corrections --- 13_time_series/13_time_series.jl | 6 +- docs/13_time_series.jl.html | 1056 +++++++++++++++--------------- 2 files changed, 531 insertions(+), 531 deletions(-) diff --git a/13_time_series/13_time_series.jl b/13_time_series/13_time_series.jl index 01b737b4..563fd3bf 100644 --- a/13_time_series/13_time_series.jl +++ b/13_time_series/13_time_series.jl @@ -797,11 +797,11 @@ As a final summary, when dealing with a time series it is very important to be a ### Summary -In this chapter, we have learned the basic foundations of time series analysis. -We have defined what a time series is and delved into a particular method, the exponential smoothing, that allows us to take into account the most distant values of our data. +In this chapter, we learned the basic foundations of time series analysis. +We defined what a time series is and delved into a particular method, the exponential smoothing, that allows us to take into account the most distant values of our data. Finally, we explained more complex versions of the method and used them to make predictions in different kinds of scenarios. When the processes followed a marked tendency, we used the trend method and the damped trend method to make long term predictions. -When the process was highly correlated with the seasonality of the year, like the quantity of air passengers in Australia, we utilized the Holt-Winters’ seasonal method. +When the process exhibited seasonal trends, we utilized the Holt-Winters’ seasonal method. ### References diff --git a/docs/13_time_series.jl.html b/docs/13_time_series.jl.html index 1d0d16df..04596a3f 100644 --- a/docs/13_time_series.jl.html +++ b/docs/13_time_series.jl.html @@ -149,137 +149,137 @@ -

Predicting the future

-
6.2 μs

Let's imagine this situation for a moment. We are sitting quietly in our house thinking about the beauty of data analysis, when an old friend calls us: Terry. We haven't seen him for a long time, but he tells us that he is forming an important investment fund for which he needs to get some experts in statistics.

+

Predicting the future

+
7.4 μs

Let's imagine this situation for a moment. We are sitting quietly in our house thinking about the beauty of data analysis, when an old friend calls us: Terry. We haven't seen him for a long time, but he tells us that he is forming an important investment fund for which he needs to get some experts in statistics.

The idea is simple, he needs to generate predictive models of the supply and demand of certain commodities so that later, with this precious information, he can determine if the prices of those products are going to go down or up. Then you simply buy long or short positions (you don't need to know much about that, basically you are betting on that prices are going to go up or down, respectively) on the stock market, hoping to make a lot of money.

With respect to the data, what you have available are long series with all the values that were taken by the supply and demand over time. In other words, we have time series.

This type of data has the particularity that its values are correlated since they are values of the same variable that changes over time. For example, the total amount of energy demanded by a nation today is closely related to yesterday's demand, and so on.

For a moment we are paralyzed and realize that we never encounter this type of problem. We take a breath for a few seconds and remember everything we learned about Bayesianism, about neural networks, about dynamic systems. It may be complex, but we will certainly succeed.

We tell Terry we're in. He smiles and gives us the first series: Peanut Chocolate

-
17.5 μs
10.5 s
+
15.5 μs
9.0 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
10.3 s

As we just said: Notice that the variables they are asking us to work with are values that are evolving over time, so they take the name of Time Series. This kind of variables are the ones we are going to deal with in this chapter and have the particularity that, as they evolve in time, the values they take are related to the previous ones. So can you think of something to solve the problem of predicting the next value the series is going to have?

+
9.3 s

As we just said: Notice that the variables they are asking us to work with are values that are evolving over time, so they take the name of Time Series. This kind of variables are the ones we are going to deal with in this chapter and have the particularity that, as they evolve in time, the values they take are related to the previous ones. So can you think of something to solve the problem of predicting the next value the series is going to have?

Many ideas must be coming to your mind. Surely some of them thought of taking as a forecast value the average of all the previous ones and some others have thought of taking directly the last value of the series, justifying that the very old values do not affect the next ones. That is to say:

1.4 ms

yT+1|T=1Tt=1Tyt

Where T is the number of periods for which we have data. Or:

-
7.0 μs

yT+1|T=yT

-
4.5 μs

Where we would always take the last value in the series to predict the next.

+
6.8 μs

yT+1|T=yT

+
3.8 μs

Where we would always take the last value in the series to predict the next.

If we observe this carefully, we might realize that these are two extreme cases of the same methodology: assigning "weights" to the previous values to predict the next one. Basically this is the way we have to indicate how much we are interested in the old observations and how much in the new ones.

In the case of simple averaging, we would be saying that we care exactly the same about all the observations, since from the first to the last they are all multiplied by the same weights:

yT+1|T=1Ty1+1Ty2+...+1TyT

@@ -296,151 +296,151 @@

Exponential Smoothing

yT+1|T=i=0T1α(1α)jyTi

Watch these two formulas for a while until you make sure you understand that they are the same :)

This way of writing the method is especially useful because it allows us to regulate how much weight we want to assign to the past values. What does this mean? That we can control how quickly the value of the weights will decline over time. As a general rule, with alpha values close to 1, a lot of weight will be given to the close values and the weights will decay very quickly, and with alpha getting close to 0 the decay will be smoother. Let´s see it:

-
13.7 μs
5.3 μs
+
11.3 μs
6.2 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + - - - - - - -
3.5 s

As we said, as alpha gets closer to one, the importance of the closer values is greater and greater and when it gets closer to zero the opposite happens, that is, the importance is "more distributed" among all the observations. For example, in the case of choosing alpha = 0.5 we would have:

+ + + + +
3.0 s

As we said, as alpha gets closer to one, the importance of the closer values is greater and greater and when it gets closer to zero the opposite happens, that is, the importance is "more distributed" among all the observations. For example, in the case of choosing alpha = 0.5 we would have:

yT+1|T=0.5yT+0.25yT1+0.125yT2+0.0625yT3+...

Well, then it seems we have a method that is good for estimating this kind of time series. But you're probably wondering... How do we choose the optimal alpha?

The basic strategy is to go through the entire time serie predicting each of the values it has with the previous ones and then look for the alpha that minimizes the difference between them. Let's see it: For this, we have to introduce two new ways of writing the same method.

@@ -469,382 +469,382 @@

Weighted average and Component form

In the case of the Simple Exponential Smoothing it is identical to the Weighted average form, but it will make our work easier later when we want to make the analysis more complex.

Optimization (or Fitting) Process

Let's start by seeing how using different alpha values we obtain different predicted value curves, defining a function that will make us the predictions of each value in the time series given a starting point (l0) and an alpha:

-
18.7 μs
SES_weight (generic function with 1 method)
35.8 μs

Let's read the above algorithm together to make sure we understand that you are applying the formula explained above.

+
18.6 μs
SES_weight (generic function with 1 method)
44.9 μs

Let's read the above algorithm together to make sure we understand that you are applying the formula explained above.

The function SES_weight receives three parameters: The alpha to make the calculation, the value of the first lo prediction, and the time series in question.

The algorithm begins by obtaining the number of points that the time series has and defining a vector in which we will be depositing all the predicted values of each point of it. Then it begins to iterate on the time series applying the formula mentioned above:

For the first predicted value y1|0=ypred=lo and for the following ones we apply the formula yt+1|t=αyt+(1α)yt|t1. Makes sense, right?

Then let's get to work and see how our prediction curve fits the real data.

-
11.8 μs
3.5 μs
+
9.7 μs
3.6 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
244 ms

This is a very cool graphic to look at. It illustrates very clearly how small alphas look mostly like an average of the time series values and, as it starts to get closer to 1, it looks more like taking the last value in the series as a future prediction. As we said at the beginning of the chapter, this method allows us to find intermediate points between the two extremes.

+ +
220 ms

This is a very cool graphic to look at. It illustrates very clearly how small alphas look mostly like an average of the time series values and, as it starts to get closer to 1, it looks more like taking the last value in the series as a future prediction. As we said at the beginning of the chapter, this method allows us to find intermediate points between the two extremes.

It's very nice to see how the graphs change as the alpha does... but how do we find the best alpha and l0 so that the fit is the best?

Loss functions

Somehow we have to be able to quantify how close the predicted curve is to the actual data curve. A very elegant way to do it is defining an error function or loss function which will return, for a certain value of the parameters to be optimized, a global number that tells us just how similar both curves are.

@@ -856,93 +856,93 @@

Loss functions

SSE=i=1T(ytyt|t1)2=i=1Te2

SAE=i=1T|ytyt|t1|=i=1T|e|

Great! Now that we have a logical and quantitative methodology to determine how well our model is fitting the data, all that remains is to implement it. Let's go for it!

-
14.4 μs
SES_weight_loss (generic function with 1 method)
38.5 μs

As you can see in the code we take the sum of errors squared as an error function. This algorithm is very similar to the previous one (with which we obtained each predicted point) only that now instead of saving the value, we compute its residual with the real data and we add it to the variable "loss".

+
21.7 μs
SES_weight_loss (generic function with 1 method)
36.5 μs

As you can see in the code we take the sum of errors squared as an error function. This algorithm is very similar to the previous one (with which we obtained each predicted point) only that now instead of saving the value, we compute its residual with the real data and we add it to the variable "loss".

As a result we have a function that has as a parameter a time series, alpha and lo; and returns a number that is telling us the general error. Obviously that is the number we want to minimize.

Let's see how the function behaves if we leave fixed lo (actually it is a value that we have to find with the minimization, but we know that it has to be near to the first value of the series) and we change alpha.

-
7.6 μs
+
8.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - -
78.8 ms

It is really very subtle, but the error function is not strictly decreasing. In fact, somewhere between 0.8 and 0.9 the function starts to grow again, so the minimum is there.

+
72.5 ms

It is really very subtle, but the error function is not strictly decreasing. In fact, somewhere between 0.8 and 0.9 the function starts to grow again, so the minimum is there.

For this kind of problems Julia has a package to optimize functions:

-
4.4 μs
4.7 s
SES_loss_ (generic function with 2 methods)
25.2 μs
 * Status: success
+
6.6 μs
4.0 s
SES_loss_ (generic function with 2 methods)
22.3 μs
 * Status: success
 
  * Candidate solution
     Final objective value:     1.423677e+04
@@ -978,249 +978,249 @@ 

Loss functions

Iterations: 6 f(x) calls: 347 ∇f(x) calls: 347 -
1.7 s

To use this function more efficiently it is necessary to define a range for the parameters in which the algorithm will perform the search and also a starting point (obviously within that range).

+
1.5 s

To use this function more efficiently it is necessary to define a range for the parameters in which the algorithm will perform the search and also a starting point (obviously within that range).

Also, one trick to keep in mind is that this package accepts "univariate" functions, that is, the function you enter only has to have one parameter to optimize. This is not entirely true since, although only one parameter has to be passed, it can be a vector, so that several parameters can be optimized. This is why we define a wrapper function SES_loss_ that facilitates the calculation.

With everything ready, let's look for the values of alpha and lo that minimize our error function:

-
5.5 μs
optim
1.5 ms

And this is how we came to fit our model, obtaining the best parameters to try to predict the next sales of our beloved peanut chocolate

-
2.6 μs
SES_weight_forecast (generic function with 1 method)
51.6 μs

As you can see, the simple exponential smoothing method only gives us a forward value. In other words, it predicts a constant value into the future. This happens because these types of time series have no latent variables defined, such as trend or seasonality. These are variables that add information to the model and allow us to make different predictions for each time we want to predict.

+
10.2 μs
optim
1.6 ms

And this is how we came to fit our model, obtaining the best parameters to try to predict the next sales of our beloved peanut chocolate

+
6.1 μs
SES_weight_forecast (generic function with 1 method)
41.3 μs

As you can see, the simple exponential smoothing method only gives us a forward value. In other words, it predicts a constant value into the future. This happens because these types of time series have no latent variables defined, such as trend or seasonality. These are variables that add information to the model and allow us to make different predictions for each time we want to predict.

But do not worry about that for now, we will study it well in a short time. For now, let's see how the prediction would look like.

-
4.3 μs
forecast
3.4 μs
+
6.2 μsforecast
3.2 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
304 ms

Perfect! We already have an initial model to attack problems where there is a lot of variability. But this is not always the case.

+ +
254 ms

Perfect! We already have an initial model to attack problems where there is a lot of variability. But this is not always the case.

That same night we talked to Terry, showed him the progress and he loves the direction we are going. He tells us that he has another time series to analyze and that he has a different behavior than the previous one, apparently this one shows a trend in the values...

Let's see:

Trend Methods

Now that we have built up an intuition of how Simple Exponential Smoothing logic works, wouldn't it be good to have some additional method that allows us to learn if our time series has a certain tendency?

But what does this mean in the first place? It means that there are some processes that inherently, by their nature, follow a marked tendency beyond random noise

For example, if we take the amount of AirPassenger in Australia from 1990 to 2016, we find this graph:

-
7.3 μs
1.7 μs
+
8.7 μs
1.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - -
53.3 ms

In this type of problem, it would not make any sense for all the predictions to be constant. Here is more latent information that we can get from the data: The trend.

+
48.2 ms

In this type of problem, it would not make any sense for all the predictions to be constant. Here is more latent information that we can get from the data: The trend.

This will be key since, once obtained, it will allow us to generate new values as we want to make more distant forecasts in time.

But how do we include this in our exponential smoothing model?

Holt’s linear trend method

@@ -1233,8 +1233,8 @@

Holt’s linear trend method

On the one hand, alpha will weight the values of the previous real observation yt1 and the predicted one using the prediction equation with the values of l and b above, to obtain the actual value of the level lt

On the other hand, the beta value will be telling us how much we are going to let the value of the slope be modified. This value has the function of weighting between the current slope found ltlt1 against the estimated slope in the previous point, to calculate the estimation of the slope in the current period. In this way, small beta values indicate that the slope is unlikely to change over time, and high values allow the slope to change freely (the value of the "current" slope ltlt1 becomes preponderant in the estimation).

With this method, then, forecasts stop being flat to become trended. With this idea in mind, let's translate the math into code again:

-
11.5 μs
HLT_loss (generic function with 1 method)
41.5 μs

This function is doing exactly the same as SES_weight_loss, which we defined earlier. It is good to clarify that, like it, the method needs an initial slope to make the estimation of the first value of the time series. Let's see which parameters optimize the model with the data we have!

-
3.9 μs
HLT_loss_ (generic function with 2 methods)
25.6 μs
 * Status: success
+
16.6 μs
HLT_loss (generic function with 1 method)
39.2 μs

This function is doing exactly the same as SES_weight_loss, which we defined earlier. It is good to clarify that, like it, the method needs an initial slope to make the estimation of the first value of the time series. Let's see which parameters optimize the model with the data we have!

+
5.3 μs
HLT_loss_ (generic function with 2 methods)
24.1 μs
 * Status: success
 
  * Candidate solution
     Final objective value:     1.284222e+02
@@ -1254,124 +1254,124 @@ 

Holt’s linear trend method

Iterations: 4 f(x) calls: 456 ∇f(x) calls: 456 -
160 ms

As with the "SES" we define a wrapper function to be able to perform the optimization. The minimum and maximum values of lo are obtained with the same criterion: The optimal value has to be close to the first value of the time series, since it is going to be its estimation. With respect to the minimum and maximum value of the slope the criterion is again looking into the data: If you see the graph of the time series you can clearly see that the slope has to be between 1 and 3 (you can see that every 5 years that pass, the number of passengers increases by 10, more or less).

+
138 ms

As with the "SES" we define a wrapper function to be able to perform the optimization. The minimum and maximum values of lo are obtained with the same criterion: The optimal value has to be close to the first value of the time series, since it is going to be its estimation. With respect to the minimum and maximum value of the slope the criterion is again looking into the data: If you see the graph of the time series you can clearly see that the slope has to be between 1 and 3 (you can see that every 5 years that pass, the number of passengers increases by 10, more or less).

Let´s see the optimal values for the parameters:

-
3.5 μs
optim1
7.1 μs

Perfect! Now that we have the optimal parameters to perform the forecast, we just need to define a function that performs it. For example:

-
3.0 μs
HLT_forecast (generic function with 1 method)
53.9 μs

As you can see in the function, the first part of it (the first for) goes through the entire time series using the parameters already optimized and making the best predictions for each point.

+
4.2 μs
optim1
6.8 μs

Perfect! Now that we have the optimal parameters to perform the forecast, we just need to define a function that performs it. For example:

+
2.6 μs
HLT_forecast (generic function with 1 method)
46.2 μs

As you can see in the function, the first part of it (the first for) goes through the entire time series using the parameters already optimized and making the best predictions for each point.

Then, when we reach the end of the time series, the second "for" begins (it will iterate the amount of periods we want to predict, value that we enter as "n_pred") to make now forecasts of periods that have not yet happened. To do this, it simply uses the last "level" that was estimated for the last value of the time series, and adds up as many slopes as periods we want: ypred=lt+bti

Finally, it returns a concatenation of the time series plus the values we ask it to predict.

-
4.7 μs
data_forecasted
792 μs
+
4.4 μsdata_forecasted
6.6 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
71.7 ms

And so it is. We already built a tool that allows us to make predictions for variables that show a trend.

+ +
63.6 ms

And so it is. We already built a tool that allows us to make predictions for variables that show a trend.

But surely you are thinking that assuming that the trend is going to be maintained during all the years that we are forecasting is a bit excessive? And it's true that it is.

It is known that this type of methods usually overestimate the values of the variable to predict, exactly because they suppose that the tendency continues.

A improvement of this method that helps to deal with this problem it is the Damped trend methods. Basically, what it does is add a coefficient that flattens the curve as we want to make more distant predictions in time. This improvement makes better predictions than the common trend methods, leaving the formulas as:

@@ -1389,216 +1389,216 @@

Holt’s linear trend method

yT+2|T=300+0.91.5+0.921.5=300+0.91.5+0.811.5=302.565

The process continues in this way until the value of ϕt is zero, that is to say that although it continues going further in the periods, to the result of the forecast no longer is added any significant term. This is why it is said that the damped method tends to make flat forecasts in the long term.

Finally, let's note that the damping parameter takes values between 1 and 0. being completely identical to the simple trend method for a value of 1 and completely flat for a value of 0. Let´s see:

-
12.1 μs
Damped_HLT_forecast (generic function with 1 method)
57.0 μs
damped_forecast
5.8 μs
+
9.0 μs
Damped_HLT_forecast (generic function with 1 method)
51.2 μs
damped_forecast
5.0 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1.6 ms

Incredible! Terry calls us and tells us that using the "damped" model was key to a trade he made. He is very excited about our work and gives us one last challenge.

+ +
1.5 ms

Incredible! Terry calls us and tells us that using the "damped" model was key to a trade he made. He is very excited about our work and gives us one last challenge.

He tells us that he is trying to enter more complex markets, particularly tourism, but that he doesn't understand how to approach this type of series since they show a lot of variability. For example, we start analysing the visitor nights in Australia spent by international tourists:

-
4.6 μs
3.8 μs
+
3.8 μs
3.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - -
27.0 ms

As you can see, the series shows a lot of variability. The values go up and down constantly.

+
25.9 ms

As you can see, the series shows a lot of variability. The values go up and down constantly.

After being a long time with Terry looking for ideas to address this type of data we realize that these ups and downs are not random, indeed, the form is repeated year after year! We realize that we are facing a problem with seasonality.

Seasonality Methods

For this type of model, in addition to the level and trend parameters, it is necessary to add another component that captures the time of year (actually it can be any other period in which seasonality occurs) in which we are and somehow influences the predicted outcome.

@@ -1623,263 +1623,263 @@

Holt-Winters’ seasonal additive method

The level equation is still a weighted average, only now the seasonal component is added to the observation (ytstm). The other part of the average is the non-seasonal forecast (lt1+bt1).

The trend equation remains the same, and the seasonality equation also represents a weighted average between the current (ytlt1bt1) and previous year's index for the same season stm. This average is weighted by the γ parameter.

Now so, let's put these weird equations into code:

-
11.5 μs
HW_Seasonal (generic function with 1 method)
47.4 μs

What this code does is, given the optimal values of each parameter, it makes the prediction of the next point (h = 1) and stores them in the "pred" array. Staying a while analyzing how the algorithm is working is an excellent way to ensure full understanding of the method :)

+
11.3 μs
HW_Seasonal (generic function with 1 method)
46.1 μs

What this code does is, given the optimal values of each parameter, it makes the prediction of the next point (h = 1) and stores them in the "pred" array. Staying a while analyzing how the algorithm is working is an excellent way to ensure full understanding of the method :)

To obtain these parameters it is necessary to write the loss function as we have been doing in the previous ones. As a challenge for the reader, we propose to write this loss function using this one as a basis. To help you, you can look at the intimate relationship between the functions that are storing the predictions and the loss functions already written in the previous methods.

In this particular case, in which our data is quarterly, m = 4. Doing the same procedure as always to optimize the function is obtained:

-
4.0 μs
787 ns

It is interesting to stop and look at these values that we obtained.

+
4.0 μs
1.0 μs

It is interesting to stop and look at these values that we obtained.

First of all it is remarkable how, for the first time, the parameter α is taking a relatively low. This makes perfect sense, as the values are now much more connected to values further away than the immediate previous one, for example. They are connected precisely because of their seasonality.

It is also interesting to note how for the initial values of the seasonality not only one value is needed, but also 4. As a general case, as much as m, the frequency of the seasonality, will be needed. This can be seen as we now need to have a whole "year 0" to make the estimates for the first year of the time series.

Now, let's see the function in action:

-
5.0 μs
season_fitted
5.0 μs
+
4.7 μs
season_fitted
4.2 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
203 ms

As you can see, the fit is very good. It is also interesting how you can appreciate the exponential smooth in the step between the valley and the peak: It starts without having it and therefore the adjusted series neither, but as it appears, the model learns it and also starts to forecast it.

+ +
202 ms

As you can see, the fit is very good. It is also interesting how you can appreciate the exponential smooth in the step between the valley and the peak: It starts without having it and therefore the adjusted series neither, but as it appears, the model learns it and also starts to forecast it.

Excellent! Now that we have our model, it's time to use it and call Terry to tell him what actions to take in his trading strategy:

-
3.5 μs
HW_Seasonal_forecast (generic function with 1 method)
60.9 μs
season_forecast
6.5 μs
+
3.4 μs
HW_Seasonal_forecast (generic function with 1 method)
62.3 μs
season_forecast
4.4 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
129 ms

Well, good! We started this chapter not knowing how to tackle time series forecasting problems and ended up building a wide variety of models for different types of data, all while making our friend Terry a lot of money!

+ +
124 ms

Well, good! We started this chapter not knowing how to tackle time series forecasting problems and ended up building a wide variety of models for different types of data, all while making our friend Terry a lot of money!

As a final summary, when dealing with a time series it is very important to be able to define if it has any latent variables such as trend or seasonality. Once we can find that underlying information, we will be able to generate forecasts with confidence. We just need to look deeper.

Summary

-

In this chapter, we have learned the basic foundations of time series analysis. We have defined what a time series is and delved into a particular method, the exponential smoothing, that allows us to take into account the most distant values of our data. Finally, we explained more complex versions of the method and used them to make predictions in different kinds of scenarios. When the processes followed a marked tendency, we used the trend method and the damped trend method to make long term predictions. When the process was highly correlated with the seasonality of the year, like the quantity of air passengers in Australia, we utilized the Holt-Winters’ seasonal method.

+

In this chapter, we learned the basic foundations of time series analysis. We defined what a time series is and delved into a particular method, the exponential smoothing, that allows us to take into account the most distant values of our data. Finally, we explained more complex versions of the method and used them to make predictions in different kinds of scenarios. When the processes followed a marked tendency, we used the trend method and the damped trend method to make long term predictions. When the process exhibited seasonal trends, we utilized the Holt-Winters’ seasonal method.

References

-
27.5 μs
+26.8 μs From a6a8d7c09074ae4dbd3fdc70476cae7fbfb432db Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Mon, 22 Mar 2021 21:15:49 -0300 Subject: [PATCH 5/6] Added feedback message --- 13_time_series/13_time_series.jl | 12 + docs/13_time_series.jl.html | 1065 +++++++++++++++--------------- 2 files changed, 549 insertions(+), 528 deletions(-) diff --git a/13_time_series/13_time_series.jl b/13_time_series/13_time_series.jl index 563fd3bf..b4cdc161 100644 --- a/13_time_series/13_time_series.jl +++ b/13_time_series/13_time_series.jl @@ -809,6 +809,17 @@ When the process exhibited seasonal trends, we utilized the Holt-Winters’ seas " +# ╔═╡ 8ade9490-8b6c-11eb-30f5-89ca2fab1b48 +md" ### Give us feedback + + +This book is currently in a beta version. We are looking forward to getting feedback and criticism: + * Submit a GitHub issue **[here](https://github.com/unbalancedparentheses/data_science_in_julia_for_hackers/issues)**. + * Mail us to **martina.cantaro@lambdaclass.com** + +Thank you! +" + # ╔═╡ Cell order: # ╟─01cbe438-34d1-11eb-087b-b5294ea7b996 # ╟─477dbb82-34d1-11eb-13f4-41b080ce2e00 @@ -874,3 +885,4 @@ When the process exhibited seasonal trends, we utilized the Holt-Winters’ seas # ╠═cb92e0c2-3f13-11eb-35eb-397810833061 # ╠═4c144296-3f13-11eb-16aa-839e299d6a63 # ╟─ec31dcb0-3fa7-11eb-102d-a3b7787cdf1d +# ╟─8ade9490-8b6c-11eb-30f5-89ca2fab1b48 diff --git a/docs/13_time_series.jl.html b/docs/13_time_series.jl.html index 04596a3f..da7ae303 100644 --- a/docs/13_time_series.jl.html +++ b/docs/13_time_series.jl.html @@ -149,137 +149,137 @@ -

Predicting the future

-
7.4 μs

Let's imagine this situation for a moment. We are sitting quietly in our house thinking about the beauty of data analysis, when an old friend calls us: Terry. We haven't seen him for a long time, but he tells us that he is forming an important investment fund for which he needs to get some experts in statistics.

+

Predicting the future

+
6.3 μs

Let's imagine this situation for a moment. We are sitting quietly in our house thinking about the beauty of data analysis, when an old friend calls us: Terry. We haven't seen him for a long time, but he tells us that he is forming an important investment fund for which he needs to get some experts in statistics.

The idea is simple, he needs to generate predictive models of the supply and demand of certain commodities so that later, with this precious information, he can determine if the prices of those products are going to go down or up. Then you simply buy long or short positions (you don't need to know much about that, basically you are betting on that prices are going to go up or down, respectively) on the stock market, hoping to make a lot of money.

With respect to the data, what you have available are long series with all the values that were taken by the supply and demand over time. In other words, we have time series.

This type of data has the particularity that its values are correlated since they are values of the same variable that changes over time. For example, the total amount of energy demanded by a nation today is closely related to yesterday's demand, and so on.

For a moment we are paralyzed and realize that we never encounter this type of problem. We take a breath for a few seconds and remember everything we learned about Bayesianism, about neural networks, about dynamic systems. It may be complex, but we will certainly succeed.

We tell Terry we're in. He smiles and gives us the first series: Peanut Chocolate

-
15.5 μs
9.0 s
+
12.1 μs
9.8 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
9.3 s

As we just said: Notice that the variables they are asking us to work with are values that are evolving over time, so they take the name of Time Series. This kind of variables are the ones we are going to deal with in this chapter and have the particularity that, as they evolve in time, the values they take are related to the previous ones. So can you think of something to solve the problem of predicting the next value the series is going to have?

+
10.1 s

As we just said: Notice that the variables they are asking us to work with are values that are evolving over time, so they take the name of Time Series. This kind of variables are the ones we are going to deal with in this chapter and have the particularity that, as they evolve in time, the values they take are related to the previous ones. So can you think of something to solve the problem of predicting the next value the series is going to have?

Many ideas must be coming to your mind. Surely some of them thought of taking as a forecast value the average of all the previous ones and some others have thought of taking directly the last value of the series, justifying that the very old values do not affect the next ones. That is to say:

-
1.4 ms

yT+1|T=1Tt=1Tyt

+
1.9 ms

yT+1|T=1Tt=1Tyt

Where T is the number of periods for which we have data. Or:

-
6.8 μs

yT+1|T=yT

-
3.8 μs

Where we would always take the last value in the series to predict the next.

+
5.8 μs

yT+1|T=yT

+
4.1 μs

Where we would always take the last value in the series to predict the next.

If we observe this carefully, we might realize that these are two extreme cases of the same methodology: assigning "weights" to the previous values to predict the next one. Basically this is the way we have to indicate how much we are interested in the old observations and how much in the new ones.

In the case of simple averaging, we would be saying that we care exactly the same about all the observations, since from the first to the last they are all multiplied by the same weights:

yT+1|T=1Ty1+1Ty2+...+1TyT

@@ -296,151 +296,151 @@

Exponential Smoothing

yT+1|T=i=0T1α(1α)jyTi

Watch these two formulas for a while until you make sure you understand that they are the same :)

This way of writing the method is especially useful because it allows us to regulate how much weight we want to assign to the past values. What does this mean? That we can control how quickly the value of the weights will decline over time. As a general rule, with alpha values close to 1, a lot of weight will be given to the close values and the weights will decay very quickly, and with alpha getting close to 0 the decay will be smoother. Let´s see it:

-
11.3 μs
6.2 μs
+
51.0 μs
5.2 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + - - - - - - -
3.0 s

As we said, as alpha gets closer to one, the importance of the closer values is greater and greater and when it gets closer to zero the opposite happens, that is, the importance is "more distributed" among all the observations. For example, in the case of choosing alpha = 0.5 we would have:

+ + + + +
3.5 s

As we said, as alpha gets closer to one, the importance of the closer values is greater and greater and when it gets closer to zero the opposite happens, that is, the importance is "more distributed" among all the observations. For example, in the case of choosing alpha = 0.5 we would have:

yT+1|T=0.5yT+0.25yT1+0.125yT2+0.0625yT3+...

Well, then it seems we have a method that is good for estimating this kind of time series. But you're probably wondering... How do we choose the optimal alpha?

The basic strategy is to go through the entire time serie predicting each of the values it has with the previous ones and then look for the alpha that minimizes the difference between them. Let's see it: For this, we have to introduce two new ways of writing the same method.

@@ -469,382 +469,382 @@

Weighted average and Component form

In the case of the Simple Exponential Smoothing it is identical to the Weighted average form, but it will make our work easier later when we want to make the analysis more complex.

Optimization (or Fitting) Process

Let's start by seeing how using different alpha values we obtain different predicted value curves, defining a function that will make us the predictions of each value in the time series given a starting point (l0) and an alpha:

-
18.6 μs
SES_weight (generic function with 1 method)
44.9 μs

Let's read the above algorithm together to make sure we understand that you are applying the formula explained above.

+
46.5 μs
SES_weight (generic function with 1 method)
40.4 μs

Let's read the above algorithm together to make sure we understand that you are applying the formula explained above.

The function SES_weight receives three parameters: The alpha to make the calculation, the value of the first lo prediction, and the time series in question.

The algorithm begins by obtaining the number of points that the time series has and defining a vector in which we will be depositing all the predicted values of each point of it. Then it begins to iterate on the time series applying the formula mentioned above:

For the first predicted value y1|0=ypred=lo and for the following ones we apply the formula yt+1|t=αyt+(1α)yt|t1. Makes sense, right?

Then let's get to work and see how our prediction curve fits the real data.

-
9.7 μs
3.6 μs
+
9.2 μs
4.4 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
220 ms

This is a very cool graphic to look at. It illustrates very clearly how small alphas look mostly like an average of the time series values and, as it starts to get closer to 1, it looks more like taking the last value in the series as a future prediction. As we said at the beginning of the chapter, this method allows us to find intermediate points between the two extremes.

+ +
248 ms

This is a very cool graphic to look at. It illustrates very clearly how small alphas look mostly like an average of the time series values and, as it starts to get closer to 1, it looks more like taking the last value in the series as a future prediction. As we said at the beginning of the chapter, this method allows us to find intermediate points between the two extremes.

It's very nice to see how the graphs change as the alpha does... but how do we find the best alpha and l0 so that the fit is the best?

Loss functions

Somehow we have to be able to quantify how close the predicted curve is to the actual data curve. A very elegant way to do it is defining an error function or loss function which will return, for a certain value of the parameters to be optimized, a global number that tells us just how similar both curves are.

@@ -856,93 +856,93 @@

Loss functions

SSE=i=1T(ytyt|t1)2=i=1Te2

SAE=i=1T|ytyt|t1|=i=1T|e|

Great! Now that we have a logical and quantitative methodology to determine how well our model is fitting the data, all that remains is to implement it. Let's go for it!

-
21.7 μs
SES_weight_loss (generic function with 1 method)
36.5 μs

As you can see in the code we take the sum of errors squared as an error function. This algorithm is very similar to the previous one (with which we obtained each predicted point) only that now instead of saving the value, we compute its residual with the real data and we add it to the variable "loss".

+
15.6 μs
SES_weight_loss (generic function with 1 method)
38.0 μs

As you can see in the code we take the sum of errors squared as an error function. This algorithm is very similar to the previous one (with which we obtained each predicted point) only that now instead of saving the value, we compute its residual with the real data and we add it to the variable "loss".

As a result we have a function that has as a parameter a time series, alpha and lo; and returns a number that is telling us the general error. Obviously that is the number we want to minimize.

Let's see how the function behaves if we leave fixed lo (actually it is a value that we have to find with the minimization, but we know that it has to be near to the first value of the series) and we change alpha.

-
8.9 μs
+
8.1 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - -
72.5 ms

It is really very subtle, but the error function is not strictly decreasing. In fact, somewhere between 0.8 and 0.9 the function starts to grow again, so the minimum is there.

+
79.5 ms

It is really very subtle, but the error function is not strictly decreasing. In fact, somewhere between 0.8 and 0.9 the function starts to grow again, so the minimum is there.

For this kind of problems Julia has a package to optimize functions:

-
6.6 μs
4.0 s
SES_loss_ (generic function with 2 methods)
22.3 μs
 * Status: success
+
4.2 μs
4.4 s
SES_loss_ (generic function with 2 methods)
31.1 μs
 * Status: success
 
  * Candidate solution
     Final objective value:     1.423677e+04
@@ -978,249 +978,249 @@ 

Loss functions

Iterations: 6 f(x) calls: 347 ∇f(x) calls: 347 -
1.5 s

To use this function more efficiently it is necessary to define a range for the parameters in which the algorithm will perform the search and also a starting point (obviously within that range).

+
1.8 s

To use this function more efficiently it is necessary to define a range for the parameters in which the algorithm will perform the search and also a starting point (obviously within that range).

Also, one trick to keep in mind is that this package accepts "univariate" functions, that is, the function you enter only has to have one parameter to optimize. This is not entirely true since, although only one parameter has to be passed, it can be a vector, so that several parameters can be optimized. This is why we define a wrapper function SES_loss_ that facilitates the calculation.

With everything ready, let's look for the values of alpha and lo that minimize our error function:

-
10.2 μs
optim
1.6 ms

And this is how we came to fit our model, obtaining the best parameters to try to predict the next sales of our beloved peanut chocolate

-
6.1 μs
SES_weight_forecast (generic function with 1 method)
41.3 μs

As you can see, the simple exponential smoothing method only gives us a forward value. In other words, it predicts a constant value into the future. This happens because these types of time series have no latent variables defined, such as trend or seasonality. These are variables that add information to the model and allow us to make different predictions for each time we want to predict.

+
5.9 μs
optim
2.0 ms

And this is how we came to fit our model, obtaining the best parameters to try to predict the next sales of our beloved peanut chocolate

+
2.9 μs
SES_weight_forecast (generic function with 1 method)
46.1 μs

As you can see, the simple exponential smoothing method only gives us a forward value. In other words, it predicts a constant value into the future. This happens because these types of time series have no latent variables defined, such as trend or seasonality. These are variables that add information to the model and allow us to make different predictions for each time we want to predict.

But do not worry about that for now, we will study it well in a short time. For now, let's see how the prediction would look like.

-
6.2 μs
forecast
3.2 μs
+
4.7 μsforecast
2.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
254 ms

Perfect! We already have an initial model to attack problems where there is a lot of variability. But this is not always the case.

+ +
271 ms

Perfect! We already have an initial model to attack problems where there is a lot of variability. But this is not always the case.

That same night we talked to Terry, showed him the progress and he loves the direction we are going. He tells us that he has another time series to analyze and that he has a different behavior than the previous one, apparently this one shows a trend in the values...

Let's see:

Trend Methods

Now that we have built up an intuition of how Simple Exponential Smoothing logic works, wouldn't it be good to have some additional method that allows us to learn if our time series has a certain tendency?

But what does this mean in the first place? It means that there are some processes that inherently, by their nature, follow a marked tendency beyond random noise

For example, if we take the amount of AirPassenger in Australia from 1990 to 2016, we find this graph:

-
8.7 μs
1.5 μs
+
6.9 μs
1.6 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - -
48.2 ms

In this type of problem, it would not make any sense for all the predictions to be constant. Here is more latent information that we can get from the data: The trend.

+
61.3 ms

In this type of problem, it would not make any sense for all the predictions to be constant. Here is more latent information that we can get from the data: The trend.

This will be key since, once obtained, it will allow us to generate new values as we want to make more distant forecasts in time.

But how do we include this in our exponential smoothing model?

Holt’s linear trend method

@@ -1233,8 +1233,8 @@

Holt’s linear trend method

On the one hand, alpha will weight the values of the previous real observation yt1 and the predicted one using the prediction equation with the values of l and b above, to obtain the actual value of the level lt

On the other hand, the beta value will be telling us how much we are going to let the value of the slope be modified. This value has the function of weighting between the current slope found ltlt1 against the estimated slope in the previous point, to calculate the estimation of the slope in the current period. In this way, small beta values indicate that the slope is unlikely to change over time, and high values allow the slope to change freely (the value of the "current" slope ltlt1 becomes preponderant in the estimation).

With this method, then, forecasts stop being flat to become trended. With this idea in mind, let's translate the math into code again:

-
16.6 μs
HLT_loss (generic function with 1 method)
39.2 μs

This function is doing exactly the same as SES_weight_loss, which we defined earlier. It is good to clarify that, like it, the method needs an initial slope to make the estimation of the first value of the time series. Let's see which parameters optimize the model with the data we have!

-
5.3 μs
HLT_loss_ (generic function with 2 methods)
24.1 μs
 * Status: success
+
11.6 μs
HLT_loss (generic function with 1 method)
42.5 μs

This function is doing exactly the same as SES_weight_loss, which we defined earlier. It is good to clarify that, like it, the method needs an initial slope to make the estimation of the first value of the time series. Let's see which parameters optimize the model with the data we have!

+
4.0 μs
HLT_loss_ (generic function with 2 methods)
27.0 μs
 * Status: success
 
  * Candidate solution
     Final objective value:     1.284222e+02
@@ -1254,124 +1254,124 @@ 

Holt’s linear trend method

Iterations: 4 f(x) calls: 456 ∇f(x) calls: 456 -
138 ms

As with the "SES" we define a wrapper function to be able to perform the optimization. The minimum and maximum values of lo are obtained with the same criterion: The optimal value has to be close to the first value of the time series, since it is going to be its estimation. With respect to the minimum and maximum value of the slope the criterion is again looking into the data: If you see the graph of the time series you can clearly see that the slope has to be between 1 and 3 (you can see that every 5 years that pass, the number of passengers increases by 10, more or less).

+
135 ms

As with the "SES" we define a wrapper function to be able to perform the optimization. The minimum and maximum values of lo are obtained with the same criterion: The optimal value has to be close to the first value of the time series, since it is going to be its estimation. With respect to the minimum and maximum value of the slope the criterion is again looking into the data: If you see the graph of the time series you can clearly see that the slope has to be between 1 and 3 (you can see that every 5 years that pass, the number of passengers increases by 10, more or less).

Let´s see the optimal values for the parameters:

-
4.2 μs
optim1
6.8 μs

Perfect! Now that we have the optimal parameters to perform the forecast, we just need to define a function that performs it. For example:

-
2.6 μs
HLT_forecast (generic function with 1 method)
46.2 μs

As you can see in the function, the first part of it (the first for) goes through the entire time series using the parameters already optimized and making the best predictions for each point.

+
3.5 μs
optim1
3.1 μs

Perfect! Now that we have the optimal parameters to perform the forecast, we just need to define a function that performs it. For example:

+
3.1 μs
HLT_forecast (generic function with 1 method)
50.5 μs

As you can see in the function, the first part of it (the first for) goes through the entire time series using the parameters already optimized and making the best predictions for each point.

Then, when we reach the end of the time series, the second "for" begins (it will iterate the amount of periods we want to predict, value that we enter as "n_pred") to make now forecasts of periods that have not yet happened. To do this, it simply uses the last "level" that was estimated for the last value of the time series, and adds up as many slopes as periods we want: ypred=lt+bti

Finally, it returns a concatenation of the time series plus the values we ask it to predict.

-
4.4 μs
data_forecasted
6.6 μs
+
6.6 μsdata_forecasted
26.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
63.6 ms

And so it is. We already built a tool that allows us to make predictions for variables that show a trend.

+ +
78.0 ms

And so it is. We already built a tool that allows us to make predictions for variables that show a trend.

But surely you are thinking that assuming that the trend is going to be maintained during all the years that we are forecasting is a bit excessive? And it's true that it is.

It is known that this type of methods usually overestimate the values of the variable to predict, exactly because they suppose that the tendency continues.

A improvement of this method that helps to deal with this problem it is the Damped trend methods. Basically, what it does is add a coefficient that flattens the curve as we want to make more distant predictions in time. This improvement makes better predictions than the common trend methods, leaving the formulas as:

@@ -1389,216 +1389,216 @@

Holt’s linear trend method

yT+2|T=300+0.91.5+0.921.5=300+0.91.5+0.811.5=302.565

The process continues in this way until the value of ϕt is zero, that is to say that although it continues going further in the periods, to the result of the forecast no longer is added any significant term. This is why it is said that the damped method tends to make flat forecasts in the long term.

Finally, let's note that the damping parameter takes values between 1 and 0. being completely identical to the simple trend method for a value of 1 and completely flat for a value of 0. Let´s see:

-
9.0 μs
Damped_HLT_forecast (generic function with 1 method)
51.2 μs
damped_forecast
5.0 μs
+
25.2 μs
Damped_HLT_forecast (generic function with 1 method)
53.2 μs
damped_forecast
8.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1.5 ms

Incredible! Terry calls us and tells us that using the "damped" model was key to a trade he made. He is very excited about our work and gives us one last challenge.

+ +
1.9 ms

Incredible! Terry calls us and tells us that using the "damped" model was key to a trade he made. He is very excited about our work and gives us one last challenge.

He tells us that he is trying to enter more complex markets, particularly tourism, but that he doesn't understand how to approach this type of series since they show a lot of variability. For example, we start analysing the visitor nights in Australia spent by international tourists:

-
3.8 μs
3.5 μs
+
7.4 μs
4.4 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - -
25.9 ms

As you can see, the series shows a lot of variability. The values go up and down constantly.

+
31.6 ms

As you can see, the series shows a lot of variability. The values go up and down constantly.

After being a long time with Terry looking for ideas to address this type of data we realize that these ups and downs are not random, indeed, the form is repeated year after year! We realize that we are facing a problem with seasonality.

Seasonality Methods

For this type of model, in addition to the level and trend parameters, it is necessary to add another component that captures the time of year (actually it can be any other period in which seasonality occurs) in which we are and somehow influences the predicted outcome.

@@ -1623,254 +1623,254 @@

Holt-Winters’ seasonal additive method

The level equation is still a weighted average, only now the seasonal component is added to the observation (ytstm). The other part of the average is the non-seasonal forecast (lt1+bt1).

The trend equation remains the same, and the seasonality equation also represents a weighted average between the current (ytlt1bt1) and previous year's index for the same season stm. This average is weighted by the γ parameter.

Now so, let's put these weird equations into code:

-
11.3 μs
HW_Seasonal (generic function with 1 method)
46.1 μs

What this code does is, given the optimal values of each parameter, it makes the prediction of the next point (h = 1) and stores them in the "pred" array. Staying a while analyzing how the algorithm is working is an excellent way to ensure full understanding of the method :)

+
13.4 μs
HW_Seasonal (generic function with 1 method)
51.4 μs

What this code does is, given the optimal values of each parameter, it makes the prediction of the next point (h = 1) and stores them in the "pred" array. Staying a while analyzing how the algorithm is working is an excellent way to ensure full understanding of the method :)

To obtain these parameters it is necessary to write the loss function as we have been doing in the previous ones. As a challenge for the reader, we propose to write this loss function using this one as a basis. To help you, you can look at the intimate relationship between the functions that are storing the predictions and the loss functions already written in the previous methods.

In this particular case, in which our data is quarterly, m = 4. Doing the same procedure as always to optimize the function is obtained:

-
4.0 μs
1.0 μs

It is interesting to stop and look at these values that we obtained.

+
4.3 μs
1.0 μs

It is interesting to stop and look at these values that we obtained.

First of all it is remarkable how, for the first time, the parameter α is taking a relatively low. This makes perfect sense, as the values are now much more connected to values further away than the immediate previous one, for example. They are connected precisely because of their seasonality.

It is also interesting to note how for the initial values of the seasonality not only one value is needed, but also 4. As a general case, as much as m, the frequency of the seasonality, will be needed. This can be seen as we now need to have a whole "year 0" to make the estimates for the first year of the time series.

Now, let's see the function in action:

-
4.7 μs
season_fitted
4.2 μs
+
5.2 μs
season_fitted
4.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
202 ms

As you can see, the fit is very good. It is also interesting how you can appreciate the exponential smooth in the step between the valley and the peak: It starts without having it and therefore the adjusted series neither, but as it appears, the model learns it and also starts to forecast it.

+ +
232 ms

As you can see, the fit is very good. It is also interesting how you can appreciate the exponential smooth in the step between the valley and the peak: It starts without having it and therefore the adjusted series neither, but as it appears, the model learns it and also starts to forecast it.

Excellent! Now that we have our model, it's time to use it and call Terry to tell him what actions to take in his trading strategy:

-
3.4 μs
HW_Seasonal_forecast (generic function with 1 method)
62.3 μs
season_forecast
4.4 μs
+
3.5 μs
HW_Seasonal_forecast (generic function with 1 method)
63.4 μs
season_forecast
4.7 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
124 ms

Well, good! We started this chapter not knowing how to tackle time series forecasting problems and ended up building a wide variety of models for different types of data, all while making our friend Terry a lot of money!

+ +
112 ms

Well, good! We started this chapter not knowing how to tackle time series forecasting problems and ended up building a wide variety of models for different types of data, all while making our friend Terry a lot of money!

As a final summary, when dealing with a time series it is very important to be able to define if it has any latent variables such as trend or seasonality. Once we can find that underlying information, we will be able to generate forecasts with confidence. We just need to look deeper.

Summary

In this chapter, we learned the basic foundations of time series analysis. We defined what a time series is and delved into a particular method, the exponential smoothing, that allows us to take into account the most distant values of our data. Finally, we explained more complex versions of the method and used them to make predictions in different kinds of scenarios. When the processes followed a marked tendency, we used the trend method and the damped trend method to make long term predictions. When the process exhibited seasonal trends, we utilized the Holt-Winters’ seasonal method.

@@ -1879,7 +1879,16 @@

References

  • Forecasting: Principles and Practice, Chap 7

  • -
    26.8 μs
    +2.5 ms

    Give us feedback

    +

    This book is currently in a beta version. We are looking forward to getting feedback and criticism:

    +
      +
    • Submit a GitHub issue here.

      +
    • +
    • Mail us to martina.cantaro@lambdaclass.com

      +
    • +
    +

    Thank you!

    +
    2.4 ms
    From a445b5869635c5896a6914a532f31520877ca2d7 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Tue, 23 Mar 2021 11:18:42 -0300 Subject: [PATCH 6/6] added to do list on .jl file --- 13_time_series/13_time_series.jl | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/13_time_series/13_time_series.jl b/13_time_series/13_time_series.jl index b4cdc161..5f370491 100644 --- a/13_time_series/13_time_series.jl +++ b/13_time_series/13_time_series.jl @@ -10,6 +10,13 @@ using Plots # ╔═╡ c470f03c-3a33-11eb-3929-7f8c45b6fdcd using Optim +# ╔═╡ 9ad5e3a0-8be2-11eb-0ca1-b9a933682476 +md"### To do list + +We are currently working on: + +"; + # ╔═╡ 01cbe438-34d1-11eb-087b-b5294ea7b996 md"# Predicting the future" @@ -821,6 +828,7 @@ Thank you! " # ╔═╡ Cell order: +# ╟─9ad5e3a0-8be2-11eb-0ca1-b9a933682476 # ╟─01cbe438-34d1-11eb-087b-b5294ea7b996 # ╟─477dbb82-34d1-11eb-13f4-41b080ce2e00 # ╠═bd9c45d2-34da-11eb-1f0f-6bb545666f98