This is the final project of the Data Engineering Zoomcamp by DataTalks Club
Chicago is one of the biggest states in the US that is full of life. Developing a data pipeline for the Crash data will help analytics to understand why accidents happen and how to prevent them, which will help save lives and make the city a safer place. It also guides how we plan our streets and neighborhoods.
The traffic crashes dataset is taken from the Chicago data portal, which contains various open-access datasets related to the city. In my project, I used two data sets the Traffic Crashes - Crashes dataset, which shows information about each traffic crash on city streets within the City of Chicago and the Traffic Crashes - People dataset, which shows information about people involved in a crash and if any injuries were sustained.
The Socrata Open Data API (SODA) provides programmatic access to this dataset, including the ability to filter, query, and aggregate data.
The technologies I used in developing the pipeline are:
Cloud: GCP
Data Lake: GCS buckets
Data Warehouse: BigQuery
Infrastructure as Code (IaC): Terraform
Workflow Orchestration: Prefect cloud
Flow containerization: Docker
Data Transformation: DBT cloud
Data Visualization: Looker Studio
- The flow module first ingests the bulk data with two flows one for each dataset (crashes and people data), those two flows are scheduled only once by
cron
dateschedule
expressions. - As the datasets are updated daily, another two flows are scheduled to execute daily for extracting newly updated data for each day
- After extracting data, data are stored in Google Data Storage (buckets) as raw Paquet files and transferred to Bigquery Data warehouse
- At this point transformation takes place through DBT.
stg
models are incremental models to synchronize the data flow pipeline daily updates, this model simply cleans the raw data found in the warehouse. - Another model is the
core
model, which is a group of analytical models that generate aview
from thestg
model for selecting some interesting insights from the data for future visualization
- Finally, I fed the Google Looker Studio with these views and made a report from them in different charts Crashes Report
- Injury severity ratio for each crash prime contribution cause: this heatmap illustrates the percentage of severe injury from total injuries for each reported crash cause. From this report, we can see that bicycles advancing on the red light and the physical condition of the driver are the two most common crash types that result in severe injuries
- Ratio of each sex as a driver involved in the crash: from this pie chart, we can see that more than half of the crashes involve male drivers
- Sex ratio as a driver for each crash cause: In this chart, we can see that the females' highest ratio in crash types are distraction from inside the vehicle and passing stopped school buses. On the other side, the male ratio is the highest in the obstructive crosswalk and under alcohol/drug influence crash types
- Weather influence on crashes: in the below chart, for each weather condition, we can see the percentage of crashes caused by the weather from the total days having that weather condition. We can conclude that severe cross-wind gates and blowing snow are the two most common weather conditions that can contribute to traffic crashes
- Relation between time of the day and month and its effect on crash frequency: from this chart, we can't see a significant effect of the daytime in different seasons on crash frequency except for early-morning crashes their frequency slightly decreases during May, June, and July. As a general overview crashes most often happen in the evening and least likely at midnight