generated from INFO-511-F24/project-final
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.qmd
166 lines (99 loc) · 12.8 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "Flight Matrix"
subtitle: "INFO 511 - Fall 2024 - Final Project"
author:
- name: "Lean Mean Learning Machines"
affiliations:
- name: "School of Information, University of Arizona"
description: "Project description"
format:
html:
code-tools: true
code-overflow: wrap
embed-resources: true
editor: visual
execute:
warning: false
echo: false
jupyter: python3
---
## Abstract
This study aims to analyze flight data from major Belgian airports using the `flights.csv` dataset, focusing on identifying patterns and trends in flight departures and arrivals. The primary research question investigates how the total number of monthly and yearly flight departures and arrivals at Brussels Airport compare to other airports in Belgium, and whether seasonal trends significantly impact flight volumes. This study also shows the flight distribution across Belgium's airports. To achieve this, a comprehensive data exploration will be conducted, followed by data preprocessing to ensure consistency and accuracy.
## **Introduction**
This project focus on analyzing the flight data from January 2016 to January 2020 for major airports in Belgium, with a focus on Brussels Airport, one of the largest and busiest in Europe. Specifically, this project investigates how the total number of flight activities at Brussels Airport compares to other airports in Belgium and examines the seasonal trends over the year. Insights from this study could assist airport management in anticipating seasonal changes, optimizing resource allocation, and improving operational efficiency.
#### **Data Description**
This data comes from Eurocontrol, an international organization for air traffic control management throughout Europe. Air Traffic Flow Management (ATFM) collects information on daily flights arriving and departing European airports ([Commercial air transport in August 2021: in recovery - Products Eurostat News - Eurostat](https://ec.europa.eu/eurostat/web/products-eurostat-news/-/ddn-20210914-1)), and we sourced from the GitHub repository associated with the Tidytuesday project ([tidytuesday/data/2022/2022-07-12/readme.md at main · rfordatascience/tidytuesday](https://github.com/rfordatascience/tidytuesday/blob/main/data/2022/2022-07-12/readme.md)). This data contains 15 columns and 7306 rows for flights from the 2016 to 2019. The columns contain information such as the date of the flight, airport designator, airport name, country of the airport, and the numbers of departures and arrivals to the airport.
Key variables we have applied in our research are:
**Quantative Variables:**
- FLT_DEP_1 - Contains total number of departing flights
- FLT_ARR_1 - Contains total number of arriving flights
**Categorical Variables:**
- APT_NAME - Contains airport names
- STATE_NAME - Contains name of the country
- MONTH_MON - Contains the month in which the records of departures and arrivals took place
#### Data **Cleaning and** Preprocessing
To ensure data integrity and analytical clarity, several preprocessing steps were undertaken:
- Date Conversion: The variable YEAR_MONTH was converted to a date-time format for temporal analysis.
- Derived Variables: New variables, such as total monthly flights (sum of departures and arrivals), were created to facilitate trend analysis.
- Handling Missing Values: Observations with incomplete data for key variables were excluded.
- Grouping and Aggregation: Data was aggregated by airport and by month and year to examine trends and seasonal patterns.
#### Exploratory Data Analysis
As is shown in Figure 1 below, Brussels Airport consistently showed the highest flight activity, highlighting its dominant role in Belgian air traffic. The chart also exhibits clear seasonal peaks during the summer and declines in winter.
![](images/paste-1.png)
Figure 1. Monthly departures and arrivals at Brussels Airport and other major airports from 2016 to 2019.
![](images/paste-2.png)
Figure 2. The average monthly flight activity for Brussels Airport and other major airports.
Figure 2, the average monthly trend figure also presents strong seasonality at Brussels Airport, with a gradual increase from January to July, peaking in the summer, and a subsequent decline toward December.
These two figures confirm that there are clear correlations between monthly flight activity and seasonal trends were observed across airports.
## **Methodology**
#### **Overview of Analysis**
This analysis provides seasonal trends and monthly flight activity at Brussels Airport compared to other major Belgian Airports for year (2016-2019). It includes exploratory data analysis, statistical tests and modeling to check differences in flight activity among the airports.
#### **Seasonal Analysis**
Goal: To find Seasonal trends in flight activity at Brussels Airport and compare them with other airports.
Steps:
1. Combining the monthly data (like grouping departures, arrivals and total flights) for all years to show seasonal patterns.
2. Calculating monthly averages for Brussels and other airports.
3. Line Plots showing trends with peak in summer ad dips in winter.
This helps us to understand the travel trends and plan for busy or slow times. See Figure 1.
#### **Total number monthly flight departures and arrivals at Brussels Airport**
Goal: To Compare monthly flight activity at Brussels Airport with other major airports.
Steps:
1. Combining the monthly data (like grouping departures, arrivals and total flights) for all years to show seasonal patterns.
2. Comparing Brussels data with average monthly activity of other airports.
3. Creating a bar and line plot to highlight difference in flight activity.
Figure 2 shows how Brussels dominate flight compared other airports compare for many consecutive years.
#### **Hypothesis Testing**
Goal: The hypothesis that we are testing in this section is to demonstrate that there are significant variations in flight distributions among Belgian airports. This test will validate whether the differences in flight activity distribution among airports are due to random chance or reflect meaningful patterns. We hypothesize that that the flight distribution patterns are not due to random chance.
Steps:
1. Creating a table containing summarized flight activity by airport and month.
2. Calculating the expected frequencies by assuming flight activity which is evenly distributed among the airports based in marginal totals in the contingency table.
3. This test compared observed values with expected values using the formula.
4. The test statistic and corresponding p-value were used to determine whether observed differences were statistically significant.
5. Creating bar plot to compare the observed and expected frequencies for each airport making deviations clear.
## Results
### A. How does the total number of monthly flight departures and arrivals at Brussels Airport compare to other major airports in Belgium?
By aggregating the data by the variables `YEAR_MONTH` and `APT_NAME`, we were able to get all the total number of flights for departure, arrivals, and totals for each month by year for each airport. We then segregated the Brussels airport data from the rest of the other Belgium airports into two different sets for comparison. For the comparison, we performed a grouping by `YEAR_MONTH` for each of the sets which gave us a sum of the flights across each month for each year for Brussels and all other airports, respectively. When we place this data on a line graph to represent time, we get Figure 1. This graph highlights a few key aspects: 1) Brussels is the most popular airport by number of flights in Belgium with more than double the number of flights per month (with exception to April of 2016 - see Discussion for more information). 2) There appears to be seasonal patterns across the years of 2016 to 2019 for both Brussels airport as well as the rest of Belgium airports, where the peak season appear to be summer time and the lowest number of flights during winter.
### B. What are the seasonal trends over the year?
To evaluate the seasonal trend, we decided to view the trend over the months of a year and aggregating the years from 2016 to 2019 using the variable `MONTH_NUM`. This allows us to look at the data with a perspective that is more general in terms of months across these years. When we plotted this data onto a line plot (Figure 2), we have a more specific trend that shows the busy season of the summer and fall months that lasts from May to September. The number of flights decrease as winter/spring season begins from October until April.
### C. Hypothesis testing
We used a chi-squared test to examine the flight distribution across different airports in Belgium. Our hypothesis in this section is that majority of the flights are in Brussels. In this section, we had decided to go with the chi-squared test instead of the h-test, as we have sufficient categories to perform upon. We chose to use the `chi2_contingency` function from the `scipy.stats` package, with the expected data being that flights are equally distributed across all airports in Belgium. The chi-squared function compared the expected to the observed, and received the following results:
- Chi-Square Statistic: 2323.9454386944262
- p-value: 0.0
With our p-value set at 0.05, this means that we are able to reject the null hypothesis and that our results show that the flights are not evenly distributed at a significant value of 0.0. The chi-squared statistic result of 2323.94 tells us that the differences that we see from the expected values are large.
### D. Statistical Analysis
In this section, we show the differences between the expected and observed values for flight distributions. Figure 3 is a bar plot that shows the deviation from the expected difference of 0 from the expected numbers for each airport at each month of the year, labeled as `month, airport`, with an aggregation of the years 2016-2019.
![](images/paste-3.png) Figure 3 displays a bar plot representing the amount of deviation that the observed data had from the expected difference of 0 for each airport at each month of the year from 2016 to 2019.
## **Discussion**
#### **Summary of Results**
This project provides insights into flight dynamics at Belgian airports, with specific focus on the Brussels Airport, which is the largest and busiest one in Belgium. The results show that Brussels Airport consistently exhibits the highest monthly flight activity compared to other major airports in Belgium. Seasonal trends are evident, with notable peaks during the summer months and declines in winter. These facts support the hypothesis that Brussels Airport plays a central role in Belgian air traffic and highlight the importance of seasonality in air traffic management. Statistical analysis further confirms significant differences in flight activity across airports, underscoring Brussels Airport’s dominance in departures and arrivals.
#### **Limitations and Possible Improvements**
- **Data scope:** This project performs analysis on flight data spanning from 2016 to 2019. This project intentionally excludes the COVID period, which has a significantly negative impact on the global air industry. To overcome it, we perform this work on larger dataset, to analyze long-term trend.
- **Outliers:** This analysis focuses on overall trends and does not account for specific outliers. For example, in April 2016 (Fig. 1), flight activity decreased at Brussels Airport while increasing at other airports. This anomaly aligns with the terrorist attacks at Brussels Airport, as reported by BBC News ([Brussels explosions: What we know about airport and metro attacks - BBC News](https://www.bbc.com/news/world-europe-35869985)).
- **Statistical Constraints:** Hypothesis testing assumes independence among groups, which may not fully reflect the interconnected nature of air traffic systems. We can try some advanced methods to better mine the data and figure out the deeper relationships.
- **External Variables:** The analysis only focuses on the results (number of changes) and does not account for external factors, such as economic conditions, or weather events, which may impact flight activity. To improve it, we need to combine several datasets, and include additional variables, like passenger volumes, flight delays, or economic indicators.
#### **Future Works**
In the future, we can do further analysis on:
- The impact of flight status on passenger satisfaction.
- Predicting flight activity based on historical records.
- The influence of mobile device services (such as online check-in) on flight operations.
In conclusion, this project focuses on previous flight records of Brussels Airport in Belgian, highlighting the critical role of Brussels Airport in Belgian. The results exhibit the significance of seasonality in flight operations and provide a foundation for further exploration and practical applications in air traffic management and airport planning.