This is the code repository of my dissertation project concerning multivariate time series classification. The project was undertaken as partial fulfillment of my BSc Mathematics & Computer Science degree at the University of Lincoln.
The project is concerning whether the traversability of a level crossing can be approximated by a classifier based on observed data. Modern navigation applications do not know the difference between a level crossing that is traversable and one that is not; in many cases, drivers could be rerouted away from level crossings to save them time, especially when the classifier recognises that an unusually long waiting time is to be anticipated. The data the classifier based on is a simulated time series of how a level crossing's barriers react relative to train traffic. This was realised by gathering arrival times of trains, which were later transformed to the time series using an algorithm. The time series data is most accurately captured by some sort of IoT device (like a sensor near tracks or a camera using computer vision) but this was not possible in this case. Following this I have implemented various classifiers based on the generated data which demonstrate that the traversability of level crossings can be modelled and this repository contains the code I have used to develop the classifiers as well as some scripts used to generate some graphs for the written document.
There are many dependencies, the install-all.bat
file runs the following commands, installing all dependencies (or you may install manually) using pip:
pip install numpy
pip install matplotlib
pip install peewee
pip install schedule
pip install pandas
pip install seaborn
pip install sklearn
pip install tensorflow
pip install gpflow
pip install keras
pip install progress
pip install beautifulsoup4
Following this all code should be executable using Python 3.8.0
. It is suggested that the scripts are ran using the command line (by double clicking on the script) especially for the classifier scripts (analyse-*.py
) because of progress bar support and better performance. The .db
file is an sqlite database which I interacted with using DBeaver.
To explain briefly, the method was as follows:
- Gather data of trains on a train stop which is right next to a level crossing (I have used Lincoln Central)
- Convert to time series format using an algorithm
- Load the time series to a dataframe, preprocess using feature engineering
- Create classifier
Each of these tasks are achieved by separate script files, here is what they each do:
init_db.py
initialises a database, which can be used later on. This is where the train information is saved,trains.db
.scraper.py
is the file responsible for scraping live information from the NationalRail website. To change the station(s) monitored, change the array of URLs in the body.Simulation.py
processes thetrains.db
file into a time series placed in/datastream/
folder, broken up by days of the week.create_load_df.py
is a snippet which loads the/datastream/
data, processes it and saves it or loads apandas
dataframe saved in a.h5
file. This file is imported by relevant scripts rather than being used on its own.backup_db.py
is the script responsible for sending a copy of a file to an email address.analyse-binary-*
are used to create and evaluate the classifiers. If you choose to load a model, use models/bcm when prompted for the name of the file.analyse-multivariate-*
are used to create and evaluate the multi class classifiers. In this case use models/mcm when prompted for the name of the file, unless you opt to generate a new model.
The models contained in /models/
are saved pandas dataframes. They are the data sets used for the training and testing of the models. There are two sets for the binary and multi class feature set each; the _experimental.h5
files contain a much wider range of features, though
they are not neccessarily used in the final models. If you desire to reproduce some results, or view the contents of the used dataset, the recommended files to use are bcm.h5
and mcm.h5
. To easily access the contents, use
inspect-df <model-name> -d
from the command line (in the folder of the repository), where <model-name>
is for example models/mcm.h5
. The flag -d
means that all data is shown, which may take some time to load on some computers. If you want a quick glance,
simply get rid of the flag.