Chicago Crashes Data Engineering project

This is the final project of the Data Engineering Zoomcamp by DataTalks Club

Introduction

Chicago is one of the biggest states in the US that is full of life. Developing a data pipeline for the Crash data will help analytics to understand why accidents happen and how to prevent them, which will help save lives and make the city a safer place. It also guides how we plan our streets and neighborhoods.

Data Source

The traffic crashes dataset is taken from the Chicago data portal, which contains various open-access datasets related to the city. In my project, I used two data sets the Traffic Crashes - Crashes dataset, which shows information about each traffic crash on city streets within the City of Chicago and the Traffic Crashes - People dataset, which shows information about people involved in a crash and if any injuries were sustained.

Data API

The Socrata Open Data API (SODA) provides programmatic access to this dataset, including the ability to filter, query, and aggregate data.

Technologies

The technologies I used in developing the pipeline are:

Cloud: GCP
Data Lake: GCS buckets
Data Warehouse: BigQuery
Infrastructure as Code (IaC): Terraform
Workflow Orchestration: Prefect cloud
Flow containerization: Docker
Data Transformation: DBT cloud
Data Visualization: Looker Studio

Project flow

The flow module first ingests the bulk data with two flows one for each dataset (crashes and people data), those two flows are scheduled only once by cron date schedule expressions.
As the datasets are updated daily, another two flows are scheduled to execute daily for extracting newly updated data for each day
After extracting data, data are stored in Google Data Storage (buckets) as raw Paquet files and transferred to Bigquery Data warehouse
At this point transformation takes place through DBT. stg models are incremental models to synchronize the data flow pipeline daily updates, this model simply cleans the raw data found in the warehouse.
Another model is the core model, which is a group of analytical models that generate a view from the stg model for selecting some interesting insights from the data for future visualization

Finally, I fed the Google Looker Studio with these views and made a report from them in different charts Crashes Report

Some insights from the report

Injury severity ratio for each crash prime contribution cause: this heatmap illustrates the percentage of severe injury from total injuries for each reported crash cause. From this report, we can see that bicycles advancing on the red light and the physical condition of the driver are the two most common crash types that result in severe injuries

Ratio of each sex as a driver involved in the crash: from this pie chart, we can see that more than half of the crashes involve male drivers

Sex ratio as a driver for each crash cause: In this chart, we can see that the females' highest ratio in crash types are distraction from inside the vehicle and passing stopped school buses. On the other side, the male ratio is the highest in the obstructive crosswalk and under alcohol/drug influence crash types

Weather influence on crashes: in the below chart, for each weather condition, we can see the percentage of crashes caused by the weather from the total days having that weather condition. We can conclude that severe cross-wind gates and blowing snow are the two most common weather conditions that can contribute to traffic crashes

Relation between time of the day and month and its effect on crash frequency: from this chart, we can't see a significant effect of the daytime in different seasons on crash frequency except for early-morning crashes their frequency slightly decreases during May, June, and July. As a general overview crashes most often happen in the evening and least likely at midnight

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
dbt		dbt
flows		flows
img		img
report		report
terraform		terraform
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chicago Crashes Data Engineering project

Introduction

Data Source

Data API

Technologies

Project flow

Some insights from the report

About

Releases

Packages

Languages

aya9aladdin/chicago-crashes-DEproject

Folders and files

Latest commit

History

Repository files navigation

Chicago Crashes Data Engineering project

Introduction

Data Source

Data API

Technologies

Project flow

Some insights from the report

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages