1- World Happiness Report: The happiness scores and rankings use data from the Gallup World Poll Survey, the dataset tries to estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations better.
2- All 250 Country Data: This Dataset contains 250 rows, Each row is a Country. There are 11 columns representing: Name, Region, Subregion, Population, Area, Gini, Real growth rate, Literacy rate, Inflation, Unemployment.
3- Life Expectancy (WHO): Health data from the year 2000-2015 for 193 countries to estimate the life expectancy through immunization factors, mortality factors, economic factors, social factors, and other health-related factors as well.
This project is implemented as a part of the Data Engineering CSEN1095 course at the GUC. Our main goal in this project is to design a complete data engineering process to answer questions we have (mentioned below) about the datasets described above. The Data Engineering process that we implement includes:
- Cleaning, tidying, plotting, integrating, exploring, and finding appropriate research questions.
- Feature Engineering and Visualizing.
- Designing ETL pipeline using Airflow.
- Deploying the project on a Big Data platform.
1- Data Integration: Data from the year "2015" was only considered from the "World Happiness Report" and "All 250 Country Data" datasets. Then, we merged the three datasets on column "Country" to get our integrated dataset.
2- Data Tidying: We converted data in columns("Real Growth Rating(%)", "Literacy Rate(%)", "Inflation(%)", "Unemployement(%)") from string into float to better be able to clean the data and run statistical analysis.
3- Data Cleaning:
A- Missing Data:
- "Unemployment(%)": We imputed the missing values using the mean(normally-distributed data) of the unemployment rate of countries from the same region as the country with the missing value.
- "Literacy Rate(%)": We imputed the missing values using the median(left-skewed data) of the Literacy rate of countries from the same region and same status (Developed/ Undeveloped) as the country with the missing value.
- "Real Growth Rating(%)": We imputed the missing values using the median(right-skewed data) of the Real Growth Rating of countries from the same region and same status (Developed/ Undeveloped) as the country with the missing value.
- "Inflation(%)": We imputed the missing values using the median(right-skewed data) of the Inflation rate of countries from the same region and same status (Developed/ Undeveloped) as the country with the missing value.
- "Alcohol", "Total expenditure": We dropped the 2 columns as they had 133/135 missing values.
- Population: Dropped duplicate column, (We preferred to use the other column as it had no missing values).
- "Gini": We imputed the missing values using the mean(normally-distributed data) of the Gini of countries from the same region and status as the country with the missing value.
- "GDP": We used data provided from World Bank in 2015 for countries, to replace the original column as it included many wrong values.
- "thinness 1-19 years": We imputed the missing values using the mean(normally-distributed data) of the "thinness 1-19 years" of countries from the same region as the country with the missing value.
- "thinness 5-9 years": We imputed the missing values using the mean(normally-distributed data) of the "thinness 5-9 years" of countries from the same region as the country with the missing value.
- "BMI": We imputed the missing values using the mean(normally-distributed data) of the BMI of countries from the same region as the country with the missing value.
- "Hepatitis B": We imputed the missing values using the median(Left skewed data) of the BMI of countries from the same region as the country with the missing value.
B- Outliers:
- Outliers were detected using box-plots for all columns that we use in our analysis separately, then we checked if those outliers contain wrong values that should be removed. If the values were correct we keep the outliers and if not we remove/impute them.
- The decision for every column's outliers is written as a Text cell below the code that we use for detecting the outliers in that column.
- Feature Engineering:
-
Feature 1: We added a new Column (GDP) describing the whole Gross Domestic Product which is the multiplication of GDP per Capita and the population. The feature is visualized using a bar plot where the x-axis represents the Region and the y-axis represents the GDP.
-
Feature 2: We added a new column(Population/KM2) that describes the number of people per km2 and is calculated by dividing the population over the area. The feature is visualized using a bar plot where the x-axis represents the Region and the y-axis represents the Population per Kilometer squared.
-
Feature 3: We added 3 new columns(Schooling Index, Life Expectancy Index, Income Index) to calculate the fourth column and to add it which is The Income Index Per Capita for each country. To compute those indices we used this Link. The feature is visualized using a bar plot where the x-axis represents the Region and the y-axis represents the GNI value.
- ETL Pipeline:
We use airflow to create our Extract-Transform-Load pipeline
-
- Data Extraction: We define three functions in the DAG file "extract_data_250_country", "extract_data_Life_exp", "extract_data_2015", "extract_GDP_per_capita" that extracts the three datasets as CSV and convert them into Dataframes and return them (Task1, Task2, Task3, Task 4).
-
- Data Transformation:
- Data Integration: We defined a function "merge_data" that takes the three Dataframes, merges them, and returns the merged Dataframe (Task 5).
- Data Cleaning: We defined a function "clean_data" that takes the merged Dataframe, cleans it as described above, and returns the cleaned Dataframe (Task 6).
- Feature Engineering: We defined a function "feature_engieering" that takes the cleaned Dataframe, adds the engineered features as described above, and returns the finalized Dataframe (Task 7).
-
- Data Loading: We define a function "store_data" that takes the finalized Dataframe and convert it to CSV and save it (Task 8).
-
Notes:
- Tasks are run in order (Task1>>Task2>>Task3>>Task4>>Task5>>Task6>>Task7>>Task8 to ensure smooth implementation of our ETL pipeline.
- XCOM is used to pass the Dataframes between the tasks.
- Does the literacy rate affect life expectancy, hepatitis b, measles, aids, diphtheria, and polio?
- Do a country's gini and its unemployment rate affect happiness?
- Does the region of a country affect its GDP per capita and the mortality of its adults?
- Does a country's area affect its population?
- Do a country's population and its GDP per capita affect the unemployment rate?
- Does the status of a country affects the schooling years or not?
- Does the Happiness score have a positive or negative relationship with Life Expectancy?
- Does GDP affect a country's Happiness score?
- Does the Population/KM2 increases the number of Measles reported cases?
- How does GNI compare with GDP for countries?
- How did the Happiness score change for Regions from year 2015 to 2019?
- Does Life Expectency increase from year 2000 to year 2015 due to the medical advancements?