This is my final thesis project for Bacherlor's KNTU Geomatic engineering.
- 1. Introduction
- 2. What Data Have I used?
- 3. Methodology & Internal Structure
- 4. Model design
- 5. Evaluation
- 6. Execution Guide
- 7. Conclusions
Taxi Demand Prediction is a Deep learning application designed to forecast the number of taxi requests in specefic region. in this case, I choose Manhattan in New York and predict demands for the next week. The predictions are displayed by city zone and broken down into hourly intervals.
So I aimed for my project to be profitable in the real world. If a taxi driver could accurately predict which boroughs or areas will have the highest demand, he could optimize his workday by focusing on those areas. This would allow him to either earn more money in the same amount of time or save time for his family and personal life, ultimately improving his quality of life.
- The purpose of this document is to provide a brief overview of the project.
-
2020-2022 Yellow Taxis data : I downloaded this dataset from NYC Open Data, a free public data source of New York City. The dataset includes 17 fields such as
Pick-up time and date
,Pick-up location
,Drop-off location
,Fare amount
,Payment method
..etc. you can have a look and review the full features description from the data dictionary here. -
Weather history dataset : I wanted to include some weather parameters such as
temp
andprecip
data to train the model, as it is sensible to think that rain would affect the taxi demand. I used Web Scrapping and downloaded it from the visual crossing website for the period between Jan-2020 to Apr-2022. but it takes a couple of days for them to respond your query so you can also download it from my repo in the Data folder. -
Polygon shape file : In order to visualize the results I needed geometric data. This
.shp
file represents the boundaries zones for taxi pickups as delimited by the New York City Taxi and Limousine Commission (TLC). You can download the file from several websites or my Repo.
I can divide the project into 3 main parts: Data Prepration & Processing, Model Designing & Tuning and Evaluation & Visualization. However, the roadmap below outlines the project steps in more detail.
In this section, the data cleaning steps were carried out, which include :
- Removing outliers and null data
- Deleting data outside the spatial and temporal limits of the study area.
- Checking the wrong data based on research and information of the study area and removing them
After performing the above steps and based on the graph obtained, more than 90% of taxi requests are related to the Manhattan area.
The features that were used in the second & third step to unify the dataset were Trip Distance
, Passenger count
, Fare amount
and Location ID
. So, out of a total of 80 million records, 64 million records remained for use in this project. Also, by aggregating the requests, we converted the dataset into one-hour intervals using Datetime
column.
In this section, We analyzed the data statistically and determine the appropriate columns based on the graphs obtained. The analyzes that have been carried out include :
1. Map pickups by zone : I plotted a choropleth map showing Manhattan taxi zones by number of pickups, highlighting the top ten in red.
2. Linear chart pickups over time : I analysed pickups' evolution over different periods of time looking for patterns.
-
Pickups evolution over Months
-
Pickups evolution over Days
-
Pickups evolution over Hours
3. Pairwise Relationships As the number of pickups or total demand is very stable over time, I analysed only one month (so that my system can handle the it). These are the relations found between the variables :
- Total demand - weekend : There are more pickups during the weekend.
- Total demand - weekday : There are more pickups on Saturday, Friday, Thursday, in this order. This variable is related to
weekend
but it contains more granularity about pickups distribution so I will keep this variable and removeweekend
when training the models. - Total demand - hour : There are more pickups between 23:00 and 3:00. This could be because there is not public transport.
- Total demand - day : There is a clear weekly pattern so this information is already given by
weekday
. Therefore, I will not useday
to train the models.
In the end, The below treemap shows the top zones in terms of taxi pick-up count. We will develop Deep Neural Network model called CNN-LSTM Encoder-Decoder with Attention mechanism
to forecast the pick-up trips in Manhattan's zones.
In this section, we perform the necessary processing suitable features and data convert into the format that the input of the model should have.
-
Feature engineering
-
First, we remove irrelevant features so that the model is properly trained.
-
Next, separate the passenger column to get the total number of passengers for each 1-hour period and and finally add the passengers column as population to our taxi data.
-
Spatial-Time based parameters : The main part of the feature engineering is related to these parameters, so that the steps to create it are done as follows
- Creating year, month, day and hour columns based on the existing datetime
- Add weekday, weekend and holiday columns based on the first column
- Creating columns related to each ID location and summing up the number of its requests in each desired time period
-
Weather parameters : In this section, we enter the data related to New York's Central Park meteorological stations obtained through web scraping and perform the necessary processing such as the specific time frame of the project and other things. For this project, only
temperature
,cloudcover
andprecipitation
can be suitable features
-
-
Feature Extraction : First, We can use Correlation diagram to understand the relationships between features to remove or change some features if necessary.
By analyzing the results of these two graphs and the previous statistical comparisons, as expected, we can find out :
- the characteristics of the
population
,hour
andmonth
had the greatest impact. - the situations of cloudy and official holidays due to their small amount does not have much relationship with the number of taxi requests throughout the
year
. - So we can ignore the characteristics of the day, since the days of the week have a greater impact, and also remove the parameters of
rainfall
andholidays
.
- the characteristics of the
Then, we delete the extra features from the dataset and sort the remaining columns so that the taxi demand columns are at the end of the dataset and normalized that. Then we changed the data format to time series
mode and saved them for 3 categories of train, validation and test. So the final datset for the model like this :
In this step, we design the neural network model using CNN-LSTM layers and combined that with Multihead Attention mechanism.
-
for the basic model, i determine the default parameters such as batch size and
Adam's
optimization and compile it. Then we run it with 100 epochs on our training and validation data to obtain the initial accuracy of the architecture. -
Also i used
multihead attention
layer in my model because, multiple attention heads operate in parallel, each with its own set of weights and parameters. These heads independently identify different dependencies in the data and examine information from various perspectives. The outputs of these heads are then combined to form a comprehensive view of the data. This approach helps the model simultaneously recognize short-term and long-term dependencies and process diverse information concurrently, leading to more accurate predictions and improved performance. -
Hyperparameter Tuning : In this optimization, the following items are checked using
RandomSeachCV
.- The number of
convolutional
layer filters - Amount of random removal of
neurons
- The number of neurons in the
LSTM
layers - LSTM layer
dropout
value - The number of neurons in hidden or
dense
layers - Output layer
activation
function - Model
optimizer
for compilation
- The number of
In this step, we save the final trained model, which has a better accuracy than the initial model, and display its Loss and RMSE
charts. Then we load the that and run on our test dataset to get the real accuracy of the model.
-
Metrices : in this project, the criteria are MAE for loss, RMSE and MSLE usage. The prediction accuracy of the model on the test data is available in the table below.
- MAE RMSE MSLE Base Model 11.2077 20.3724 0.5004 Tuned Model 9.7939 327.5211 0.1845 -
As a visual comparison, for a random one-hour periods and 30 regions with different demand distributions, we display the graph of actual and predicted values.
Based on the graphs, it can be seen that the model has understood the request pattern well, but in some areas, due to the lack of data, the model's prediction accuracy has decreased.
- Clone or download the repository.
- navigate to
TaxiDemand-Prediction-Using-DeepLearning\
. - Type
streamlit run StreamlitMap.py
in the command line. - Copy the returned Network URL like
[(http://172.19.0.1:8501)](http://localhost:8501)
and paste in your internet browser. - That´s it! The app takes a couple of seconds to load cause using Big data.
Note: There are multiple environments on which you can execute the app and I am not capable to cover them all. So these steps refer to my personal environment(Windows 11)
As expected, the accuracy of the model is much better in the regions where the number of requests is higher. According to all the available data and results, it can be said that the model is accurate. However, it is possible to increase the accuracy of the model in low demand areas by increasing the amount of data or their time intervals.
Also, to increase the accuracy of the model in general, data such as traffic
and the amount of taxi drop-offs
in each area can be used as effective features.
It seems like there is a big gap between the product creation and the product use. There are lots of tools for data scientist to analyse data, clean, transform, train models, visualize data, etc. But once all that work is done, we need to put into production, create a product that someone unskilled in the field can use, for example, a web application. Streamlit seems to be the best option, and yet it is in very early stages. For this reason I encountered a significant number of bugs in streamlit while trying to integrate an altair choropleth map. This made me realise of how young is still the Data Science field and some of its tools.
I hope this repo is useful for you and I will be honored if you share your thoughts about that with me 😄.