https://github.com/regostar/us_accidents_eda/blob/main/Presentation.pdf
https://github.com/regostar/us_accidents_eda/blob/main/eda-latest.pdf
This data comprises of data about accidents in US from 2016 to 2023, sourced from Kaggle. This dataset consists of over 7 Billion data points and 46 unique features.
https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents/data
Size of the Dataset – 7B data points
Pyspark, MLib and custom GCP architecture.
Multiple integrations with tools like Hive, Solr.
Highly Scalable with the current allowed infrastructure. (added auto scaling policy on gcp)
Accuracy of over 94% to predict severity of accident using Random forest Classifier.
State of the art – business ready architecture and infrastructure utilized.
The core of the architecture involves utilizing Google Cloud Dataproc to create and manage clusters. Dataproc provides a fully managed Apache Spark and Hadoop service, allowing for scalable and efficient data processing.
For distributed data processing, PySpark, the Python API for Apache Spark, is used. It makes it easier to create scalable and parallelized data transformations and analytics that take advantage of the Dataproc cluster's processing capability.
a Hadoop data warehouse and SQL-like query language, is integrated into the design. It offers organized querying of the dataset, allowing for the creation of tables and the execution of complicated queries, hence improving data retrieval and analysis performance.
for Full-Text Search and Indexing
for storing intermediate results, works in tandem with dataproc for storing the dataset.
for providing authorized access to users to manage services.
– set as a policy to scale up VM's by creating new nodes.
- this provides a user friendly way to work on the pyspark- python code.
on GCP DataProc which provides rich indexing and querying capabilities
For more stats, please view - https://github.com/regostar/us_accidents_eda/blob/main/eda-latest.pdf
Jayasyam Reddy Desireddy - Programmed the prediction algorithms. Achieved excellent accuracy using a combination of models.
Harika Samala - Contributed to preliminary data analysis and EDA