GitHub

Quick Links :-

US Accidents Dataset

This data comprises of data about accidents in US from 2016 to 2023, sourced from Kaggle. This dataset consists of over 7 Billion data points and 46 unique features.

https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents/data

Features of Project :-

Size of the Dataset – 7B data points

Pyspark, MLib and custom GCP architecture.

Multiple integrations with tools like Hive, Solr.

Highly Scalable with the current allowed infrastructure. (added auto scaling policy on gcp)

Accuracy of over 94% to predict severity of accident using Random forest Classifier.

State of the art – business ready architecture and infrastructure utilized.

Architecture:-

Google Cloud Platform (GCP) Infrastructure

Dataproc Cluster

The core of the architecture involves utilizing Google Cloud Dataproc to create and manage clusters. Dataproc provides a fully managed Apache Spark and Hadoop service, allowing for scalable and efficient data processing.

PySpark

For distributed data processing, PySpark, the Python API for Apache Spark, is used. It makes it easier to create scalable and parallelized data transformations and analytics that take advantage of the Dataproc cluster's processing capability.

Hive

a Hadoop data warehouse and SQL-like query language, is integrated into the design. It offers organized querying of the dataset, allowing for the creation of tables and the execution of complicated queries, hence improving data retrieval and analysis performance.

Solr

for Full-Text Search and Indexing

Googler Cloud Storage

for storing intermediate results, works in tandem with dataproc for storing the dataset.

Identity and Access Management (IAM)

for providing authorized access to users to manage services.

Autoscaling policy

– set as a policy to scale up VM's by creating new nodes.

Visualizations using Jupyter Notebook

this provides a user friendly way to work on the pyspark- python code.

Apache Solr

on GCP DataProc which provides rich indexing and querying capabilities

Results :-

For more stats, please view - https://github.com/regostar/us_accidents_eda/blob/main/eda-latest.pdf

Contributors :-

Jayasyam Reddy Desireddy - Programmed the prediction algorithms. Achieved excellent accuracy using a combination of models.

Harika Samala - Contributed to preliminary data analysis and EDA

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LICENSE		LICENSE
Presentation.pdf		Presentation.pdf
README.md		README.md
eda-latest.ipynb		eda-latest.ipynb
eda-latest.pdf		eda-latest.pdf
machine_learning.ipynb		machine_learning.ipynb
machine_learning.pdf		machine_learning.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Links :-

US Accidents Dataset

Features of Project :-

Architecture:-

Google Cloud Platform (GCP) Infrastructure

Dataproc Cluster

PySpark

Hive

Solr

Googler Cloud Storage

Identity and Access Management (IAM)

Autoscaling policy

Visualizations using Jupyter Notebook

Apache Solr

Results :-

Contributors :-

About

Releases

Packages

Languages

License

regostar/us_accidents_eda

Folders and files

Latest commit

History

Repository files navigation

Quick Links :-

US Accidents Dataset

Features of Project :-

Architecture:-

Google Cloud Platform (GCP) Infrastructure​

Dataproc Cluster

PySpark

Hive

Solr

Googler Cloud Storage

Identity and Access Management (IAM)

Autoscaling policy

Visualizations using Jupyter Notebook

Apache Solr

Results :-

Contributors :-

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Google Cloud Platform (GCP) Infrastructure

Packages