This repository contains our project for Big Data Computing. The project focuses on utilizing Hadoop MapReduce on Google Cloud Platform (GCP) to analyze 2 different large datasets, one on New York City Parking and one of NBA shot logs. Our goal was to gain a better understanding of Hadoop MapReduce for distributed data processing, Hadoop Distributed File System (HDFS) for scalable storage, and the advantages of using GCP for managing and processing large datasets efficiently.
Hadoop is an open-source framework designed to store and process large datasets across clusters of computers using simple programming models. It provides a distributed storage and processing system, ensuring fault tolerance and high availability.
- Hadoop Distributed File System (HDFS): A distributed file system that stores data across multiple machines, providing high throughput access to application data.
- MapReduce: A programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
MapReduce is the core component of Hadoop that allows for scalable and efficient processing of large datasets. It is composed of two primary functions:
- Map Function: The Map function processes input data and produces a set of intermediate key/value pairs. Each input data element is independently processed by the map function.
- Reduce Function: The Reduce function takes the intermediate key/value pairs produced by the map function and merges them to form a possibly smaller set of output key/value pairs. The reduce function is applied in parallel, typically after the map phase is complete.
- Input Split: The input data is split into fixed-size pieces called input splits or chunks. Each split is processed by a separate map task.
- Map Phase: Each map task processes an input split and generates intermediate key/value pairs. These pairs are grouped by key before being passed to the reduce phase.
- Shuffle and Sort: The intermediate key/value pairs are shuffled (i.e., transferred across the network to reducers) and sorted by key. This step ensures that all values associated with a particular key are sent to the same reducer.
- Reduce Phase: The reduce task processes the intermediate key/value pairs, merging them to produce the final output. The reduce function is applied to each unique key, aggregating its values.
- Output: The final output is written to the HDFS, consisting of key/value pairs that are stored across the distributed file system.
Hadoop MapReduce provides a robust and scalable approach to processing vast amounts of data across distributed systems. Its simplicity and effectiveness have made it a cornerstone of big data processing, enabling organizations to derive insights from their data efficiently.
We leveraged the Google Cloud Platform (GCP) to deploy and manage a Hadoop cluster, enabling us to utilize the power of Hadoop MapReduce for large-scale data processing.
Our goal here was to analyze the parking violation data from New York City to gain insights into patterns and trends. We utilized data published by NYC OpenData. This dataset provides data on Parking Violations Issued between July 1, 2023 to June 30, 2024. It contains various features such as car model, color, location, time, and more.
We aimed to answer the following questions:
- When are tickets most likely to be issued?
- What are the most common years and types of cars to be ticketed?
- Where are tickets most commonly issued?
- Which color of the vehicle is most likely to get a ticket?
Doing so helped us understand the distribution and frequency of parking violations across the city, which can help in guiding policy decisions and improving enforcement strategies.
For downloading the NYC parking violation dataset, do the following:
- Start 3 node cluster from Google Cloud account and connect from command line using the external IP of the Manager node (instance-1). Then use su root command and provide password.
- Navigate to the data folder
cd mapreduce-test/mapreduce-test-data
- Download the data using the fetch_data.sh file we created
bash fetch_data.sh
- Ensure NYC_parking_violations.csv has been created and is stored within
cd mapreduce-test/mapreduce-test-data
In the fetch_data.sh, we used the curl command with NYC OpenData’s API. The API has a default limit of 1,000 rows of data that can be provided at one time. There are ways to adjust this by using offset and limit in the API call. Our script changes the limit to 100,000 rows at a time. It pulls the first 100,000 rows and writes it to the parking_violations.csv file. Then it loops through the rest of the data, adjusting the offset with every iteration and appending to parking_violations.csv. Since there are 10.7 millions rows of data and this script pulls the data in batches of 100,000 it needs to run 107 iterations. We added the iteration number and the url to the output so we can monitor the progress and confirm that the offset is updating with each iteration.
- Navigate to mapreduce-test/mapreduce-test-python/nycpark
- Ensure that there are 3 files there: mapper.py, reducer.py, and test.sh
- Run the code by executing the command
bash test.sh
From our output, we can see that tickets are most likely to be issued on Thursdays with 1,561,910 parking tickets issued on that day. Tickets are least likely to be issued on Sunday with just 956,015 tickets issued.
- Navigate to mapreduce-test/mapreduce-test-python/nycpark
- Ensure that there are 3 files there: mapper_1.py, reducer_1.py, and test_1.sh
- Run the code by executing the command
bash test_1.sh
We can see that 2021 Toyota’s had the highest number of tickets in our dataset. We can also see that Toyotas and Hondas were the most frequently ticketed vehicles as well as most tickets being issued to cars from the last few years.
- Navigate to mapreduce-test/mapreduce-test-python/nycpark
- Ensure that there are 3 files there: mapper_2.py, reducer_2.py, and test_2.sh
- Run the code by executing the command
bash test_2.sh
The street with the highest number of tickets is ‘WB N Conduit Ave @ 8’ in Queens with 193,732 parking tickets. The 2nd highest count is ‘Broadway’ in Manhattan which is one of the most popular streets in NYC. This output allows us to see which streets frequently get the highest number of tickets, giving us insight on where we should not park!
- Navigate to mapreduce-test/mapreduce-test-python/nycpark
- Ensure that there are 3 files there: mapper_3.py, reducer_3.py, and test_3.sh
- Run the code by executing the command
bash test_3.sh
From this output, we can see that black vehicles are the most frequently ticketed vehicles in our dataset with 2,653,385 tickets. The colors gray, white, red, and blue round out the top 5 most frequently ticketed vehicle colors.
Here we utilize Hadoop MapReduce to analyze NBA shot logs from the 2014-2015 NBA season. The dataset was obtained from Kaggle containing various details such as shot location, shooter, nearest defender, shot clock time, and more. Each row represents a shot, and the columns provide details of the shot.
The main objective of this project was to find each players "most feared defender". We defined this through our fear score. The calculation goes as follows: For each pair of the players (A, B), we define the fear score of A when facing B as the hit rate of shots out of total attempts. Based on the fear score for each defender of player A, we determined the most feared defender to be the defender with the lowest fear score, that was the closest defender at least 5 times. We set the threshold to 5 instances, so that we ensure statistical significance in our determination of the most feared defender, thereby filtering out isolated incidents.
- Start 3 node cluster from Google Cloud account and connect from command line using the external IP of the Manager node (instance-1). Then use su root command and provide password.
- Ensure that dataset is saved in mapreduce-test-data folder and named 'nba_shot_logs.csv'
- Navigate to /mapreduce-test/mapreduce-test-python/nba folder
- Ensure that there are 3 files there: mapper.py, reducer.py, and test.sh
- Run the code by executing the command
bash test.sh
For each player in the first column, the player in the second column is there most feared defender. Column 3 shows the amount of field goals player A had when player B was the closest defender. Column 4 shows the field goal percentage for this permutation. Column 5 shows the total amount of field goal attempts in this permutation. This data is useful because it helps identify which defenders are most effective against specific players, allowing coaches and analysts to develop strategies for matchups and defensive assignments.