-
Notifications
You must be signed in to change notification settings - Fork 11
Neural Networks Architectures
The artificial neural networks (NNs) were first introduced by McCulloch and Pitts in 1943 [1]. Since then, numerous architectures of NNs were created for various tasks. For the task of time series forecasting (TSF), many architectures exist, so it is not possible to implement all of them in our framework. In the following sections, we provide a list of the implemented architectures with recommended values of hyperparameters, and we also list other architectures we considered for implementation. Out of the currently implemented architectures, we recommend using the LSTM.
All of our models use only the last O observations to predict the F future values. The O and F can be set by the Observations and Future predictions fields respectively. If we denote by S the number of input signals and by T the number of target signals, the inputs of the networks are matrices with S rows and O columns and the outputs of the networks are matrices with T rows and F columns.
The multilayer perceptron (also known as fully-connected or dense) neural network architecture is the most basic type of feed-forward architecture. As there are more advanced architectures for TSF, we currently recommended using the LSTM over the MLP.
The MPL consists of an input layer, several hidden layers of neurons, and the output layer. The neurons in successive layers are connected in a fully-connected manner meaning that each neuron in a layers is connected to all the outputs of neurons in the previous layer. There are no connections between the neurons in the same layer, and there are also no connections between layers which are not immediately successive.
There can be one or multiple hidden layers, which determines the depth of the network. The deeper networks have higher capacity and are able to learn more complex tasks, but are also more prone to overfitting.
Each of the connections between the neurons in the network has an associated weight. The training algorithm updates the weights of the connections in order to find a mapping from the inputs to the outputs.
As we work with a matrix of data in the input and the hidden layers are only lists of neurons, we have to flatten the inputs into a vector before we feed them into the network. Similarly, the outputs have to be reshaped from the output vector of the network to the time steps.
For TSF, adding a residual (skip) connection around the whole network might be useful. The outputs of the network are then added to the inputs which allows the network to model the difference between inputs and outputs instead of the values of the outputs. As we expect the inputs and outputs to differ only by a small amount, learning to predict the difference between inputs and outputs can be easier (smaller number of neurons is sufficient for good performance) than learning to predict the values of the outputs.
Adding a residual connection around the entire network can be enabled (and even tuned) in the hyperparameters settings.
For our windowed dataset, we implement the residual connection by adding the values of the signals of the last time step of the input window to all the time steps of the output window. This means that the output signals have to be a subset of the input signals (which is recommended anyway).
-
Hidden layers sets the number of hidden layers and the number of neurons in the layers. The counts of neurons in layers are repeated if more layers than specified neuron counts are needed (as described in the documentation to tunable settings).
-
If Predict differences is checked, a residual connection is added around the MLP network. This parameter can also be tuned automatically.
The MLPs are a really basic architecture and usually perform worse than the more advanced architectures such as LSTM. It is also not clear how to set the hyperparameters for the best performance, we thus recommended leaving that to the tuner. Our default configuration of the hyperparameters is:
Hyperparameter | Default value |
---|---|
Number of hidden layers | 2-5 (tuned) |
Number of neurons in each layer | 8-64 (tuned) |
Predict differences | tuned |
We base our recommendations on Lara-Benítez et al. [2]. They tested several architectures and hyperparameter configurations and the MLP performed the worst. The MLPs might be considered when the training time is crucial as they have the highest computational efficiency of the tested models, but they are still probably outperformed by the CNNs which offer better forecasting accuracy and not significantly worse training times.
The LSTM are the most common architecture of cells inside the recurrent neural networks (RNNs). They were introduced in 1997 by Hochreiter and Schmidhuber [3]. Great description of the LSTM can be found on the Colah’s blog [4]. We will not focus on the inside of the LSTM here. Rather, we will describe how LSTM cells are used in our RNN architecture for TSF.
The RNNs are usually applied on sequences of data. One item (time step in case of TSF) from the sequence is given to the network cell at the time and the network produces an output for the item and keeps an inner state. The state is preserved when passing the next item from the sequence to the network as the new input.
Depending on the application of the network, the outputs may be produced for each item in the sequence, or the state after inputting all the items is used as a representation of the whole sequence.
Our architecture uses LSTM to produce a representation of the whole input sequence – the state of the last LSTM cell. Then, a dense layer is added to produce the outputs from the state.
More than one layer of LSTM can be used to encode the sequence. In such case, the outputs of the first layer for each time step are used as the inputs to the next layer. The state of the last layer is the sequence representation.
The picture shows an example of the architecture with two LSTM layers. For the purpose of the visualization, the LSTM layer has been unrolled, but notice that all the LSTM cells in each layer are actually the same LSTM cell with the input time steps fed one at the time.
- LSTM layers sets the number of LSTM layers and the number of units in each layer.
Hyperparameter | Default value |
---|---|
Number of LSTM layers | 1-2 (tuned) |
Number of neurons in each LSTM layer | 32-128* (tuned) |
* For higher number of input signals, we recommended setting the maximum to a higher value (such as 256).
Our recommendations are again based on Lara-Benítez et al. [2]. They analyzed that networks with 2 hidden layers of LSTM perform significantly better than networks with other number of hidden layers. Their analysis also shows that one hidden layer is better than three or more hidden layers. They also argue that, "In the case of LSTM models, it can be seen that the parameter choice has a minimal impact on the accuracy results."
It should be noted that Lara-Benítez et al. [2] only analyzed time series with one input/output signals. Vaidhyanathan [5] in his predictions of a time series with 22 signals uses a single-layer LSTM with 294 neurons. We thus decided to recommend increasing the maximum number of neurons for series with multiple signals.
The convolutional neural networks (CNN) are a layered feed-forward architecture. The key difference from the MLP is in the way the successive layers are connected. The CNNs work with the original structure of the data -- the sequence of time steps. The neurons in each layer are also organized in a sequence.
Each neuron computes a convolution which means that the inputs of the neuron are only a small subset of the outputs from the previous layer which are close to this neuron. We can also view this as taking a small sliding window and moving it across the previous layer. For each window position, the output is computed and placed in the corresponding position in the sequence in the next layer, which preserves the structure of the data. Similarly to the multiple signals for each time step we have as the input, there can be multiple output of each window position. These are called filters in the CNNs.
The other key concept of the convolutional layers is that the weights are shared among all the neurons in the same layer. In the sliding window point of view, we use the same window with the same weights to compute all the outputs in one layer.
The last piece of the CNNs are the pooling layers. These are used to reduce the spacial dimensions of the data. They work similarly to the convolutional layers in the way that we take a sliding window and move it across the data. But this time, we move it by more than one item at the time, which means that a smaller number of output items are produced. The pooling layer can use weights similarly to the convolutional layers, but a max polling (taking the maximum of the window) or average pooling layers are used more often.
Lara-Benítez et al. [2] also tested the CNNs, so we can again base our hyperparameter recommendations on their work. "The prediction accuracy is proportional to the number of convolutional layers. The best results have been obtained with models with four layers.” They tested layers with 16, 32, and 64 filters and didn't get any significant impact on the performance. On the other hand, "Best predictions have been obtained from models without max-pooling, suggesting that this popular image-processing operation is not suitable for time series forecasting.” They used fixed kernel sizes (size of the convolution window) of 7-5-3-3 in their four-layer models.
As we already mentioned in the description of MLP, the CNNs offer a great training time, not significantly worse than MLP and faster than the other architectures. [2]
The Encoder-Decoder (ED) architecture is a direct improvement of the LSTM architecture mentioned above. The encoder is used to create a representation of the input sequence. This is usually done using one or more LSTM layers. The decoder's job is to create predictions from the sequence representation. In the LSTM architecture described above, the decoder was just one dense layer.
One of the options how to create a more advanced decoder is to use several dense layers. The MLP decoder works by adding hidden dense layers before the final output dense layer. We basically create an MLP to compute the predictions from the sequence representation.
Other option is to use another RNN as the decoder, often with the same cells. The LSTM decoder consists of an LSTM layer which takes the sequence representation as its initial state. This time, we use the LSTM to produce as many outputs as necessary. Apart from the state, the LSTM cells also need inputs in order to produce outputs. One of the options is to give them some dummy inputs, such as a vector of zeros. The more advanced option is to use the prediction for the previous time step as the input for the next time step. This is called an autoregressive architecture. When training the network, so-called teacher forcing is used, which means that the correct outputs are used instead of the predictions as the inputs.
The results of experiments of Hewamalage et al. [6] shown that the dense layer decoder (as we use in the LSTM architecture) performed better than the LSTM decoder, "owing to the error accumulation issue associated with teacher forcing by the decoder." They did not test the LSTM decoder without teacher forcing nor the MLP decoder. Similar results of autoregressive model performing worse than the simple LSTM were obtained also in the Time series forecasting tutorial from TensorFlow [7].
In the work of Cook and Hall [8], the ED architecture performed the best. Unfortunately, they don't provide details on the number of units in the LSTM cells and other hyperparameters.
One of the issues for our application of the TSF task is that we have limited training data. Only the historical records for the predicted signal set are available for training, which might not be enough to train a deep neural network. The idea of transfer learning tries to solve this problem by using information from other time series. The network is first trained on a (big) dataset of time series and saved. When we want to do a prediction for a new signal set, we load the connection weights from this pre-trained model and only fine-tune them on the new time series. The idea is that all time series behave similarly to some extent, so using the training data from other time series might help improve predictions of our signal set.
The downside of the transfer learning is that for each set of hyperparameters, a new pre-trained model has to be prepared. It is not easily possible to share the connection weights between models with a different number of neurons and thus different number of time series number of connections. However, this can be solved by providing only a few pre-trained models, differing in their complexity. In our use case, the model complexity can be the hyperparameter tuned by our tuner.
A similar idea can also be used when training a model for a dataset of multiple time series (which is not our use case). In the literature, such models are often referred to as global models – these are trained on all time series from the dataset. The global model can then be cloned and fine-tuned using local parameters for each time series in the dataset.
Many works on transfer learning in TSF have been published recently, we want to mention some of them, even though most of these techniques are quite advanced and if we decided to add transfer learning to our framework, we would probably use only a simple method of pre-training and fine-tuning. He et al. [9] investigate multi-source deep transfer learning for financial time series and propose two multi-source transfer learning methods. Ye and Dai [10] propose a CNN-based transfer learning models with a time series similarity measure to select the appropriate source domain. Sagheer et al. [11] created a LSTM-based transfer learning model for hierarchical time series (set of data sequences organized by aggregation constraints). Other literature references can also be found in Section 2.4 of Hewamalage et al. [6].
Another common approach to improve accuracy of the models is to use an ensemble of networks instead of just one network. Ensembles combine multiple weak learners together to generate a more robust prediction. One of the simplest ways to create an ensemble is bagging, proposed by Breiman [12], which trains the individual networks on a random subset of the original dataset. The use of bagging and bootstrapping for computing prediction intervals is summarized by Petneházi [13].
A more advanced ensemble approach was proposed by Smyl [14], who segmented the problem into two parts: developing a group of specialised RNN models, and ensembling them to derive a combined prediction.
Hybrid methods combine neural network models together with other machine learning and statistical techniques. The M4 forecasting competition [15] which analysed 61 forecasting methods on 100,000 time series showed the superiority of hybrid approaches that combine both statistical and machine learning features. The winning model was submitted by Smyl [16] and it mixes the exponential smoothing with RNNs. The second-best model by Montero-Manso et al. [17] propose an automated method for obtaining weighted forecast combinations using various forecasting methods.
Tealab [18] provides a systematic review of recent neural networks models for TSF, most of which are hybrid models.
[1] McCulloch, Warren S., and Walter Pitts. “A Logical Calculus of the Ideas Immanent in Nervous Activity.” The Bulletin of Mathematical Biophysics 5, no. 4 (December 1, 1943): 115–33. https://doi.org/10.1007/BF02478259.
[2] Lara-Benítez, Pedro, Manuel Carranza-García, and José C. Riquelme. “An Experimental Review on Deep Learning Architectures for Time Series Forecasting.” ArXiv:2103.12057 [Cs], April 8, 2021. https://doi.org/10.1142/S0129065721300011.
[3] Hochreiter, Sepp, and Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation 9, no. 8 (November 15, 1997): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
[4] Olah, Christopher. “Understanding LSTM Networks.” colah’s blog, August 27, 2015. https://colah.github.io/posts/2015-08-Understanding-LSTMs/.
[5] Vaidhyanathan, Karthik. “Data-Driven Self-Adaptive Architecting Using Machine Learning.” Doctoral Thesis, GSSI Gran Sasso Science Institute, 2021. https://iris.gssi.it/handle/20.500.12571/15976.
[6] Hewamalage, Hansika, Christoph Bergmeir, and Kasun Bandara. “Recurrent Neural Networks for Time Series Forecasting: Current Status and Future Directions.” International Journal of Forecasting 37, no. 1 (2021): 388–427. https://doi.org/10.1016/j.ijforecast.2020.06.008.
[7] The TensorFlow Authors. “Time Series Forecasting.” TensorFlow. Accessed September 4, 2021. https://www.tensorflow.org/tutorials/structured_data/time_series.
[8] Cook, Thomas R., and Aaron Smalter Hall. “Macroeconomic Indicator Forecasting with Deep Neural Networks.” Accessed August 29, 2021. https://www.kansascityfed.org/research/research-working-papers/macroeconomic-indicator-forecasting-deep-neural-networks-2017/.
[9] He, Qi-Qiao, Patrick Cheong-Iao Pang, and Yain-Whar Si. “Multi-Source Transfer Learning with Ensemble for Financial Time Series Forecasting.” In 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), 227–33, 2020. https://doi.org/10.1109/WIIAT50758.2020.00034.
[10] Ye, Rui, and Qun Dai. “Implementing Transfer Learning across Different Datasets for Time Series Forecasting.” Pattern Recognition 109 (January 1, 2021): 107617. https://doi.org/10.1016/j.patcog.2020.107617.
[11] Sagheer, Alaa, Hala Hamdoun, and Hassan Youness. “Deep LSTM-Based Transfer Learning Approach for Coherent Forecasts in Hierarchical Time Series.” Sensors 21, no. 13 (January 2021): 4379. https://doi.org/10.3390/s21134379.
[12] Breiman, Leo. “Bagging Predictors.” Machine Learning 24, no. 2 (August 1, 1996): 123–40. https://doi.org/10.1007/BF00058655.
[13] Petneházi, Gábor. “Recurrent Neural Networks for Time Series Forecasting.” ArXiv:1901.00069 [Cs, Stat], December 31, 2018. http://arxiv.org/abs/1901.00069.
[14] Smyl, Slawek. “Ensemble of Specialized Neural Networks for Time Series Forecasting.” Cairns, 2017. https://forecasters.org/wp-content/uploads/gravity_forms/7-c6dd08fee7f0065037affb5b74fec20a/2017/07/smyl_slawek_ISF2017.pdf.
[15] Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. “The M4 Competition: 100,000 Time Series and 61 Forecasting Methods.” International Journal of Forecasting, M4 Competition, 36, no. 1 (January 1, 2020): 54–74. https://doi.org/10.1016/j.ijforecast.2019.04.014.
[16] Smyl, Slawek. “A Hybrid Method of Exponential Smoothing and Recurrent Neural Networks for Time Series Forecasting.” International Journal of Forecasting, M4 Competition, 36, no. 1 (January 1, 2020): 75–85. https://doi.org/10.1016/j.ijforecast.2019.03.017.
[17] Montero-Manso, Pablo, George Athanasopoulos, Rob J. Hyndman, and Thiyanga S. Talagala. “FFORMA: Feature-Based Forecast Model Averaging.” International Journal of Forecasting, M4 Competition, 36, no. 1 (January 1, 2020): 86–92. https://doi.org/10.1016/j.ijforecast.2019.02.011.
[18] Tealab, Ahmed. “Time Series Forecasting Using Artificial Neural Networks Methodologies: A Systematic Review.” Future Computing and Informatics Journal 3, no. 2 (December 1, 2018): 334–40. https://doi.org/10.1016/j.fcij.2018.10.003.