Big-Data

Repository for Big Data Projects

Map Reduce using Hadoop (in Java)

The analysis was made on the Quickdraw Dataset. This dataset consists of detailed description of each drawing in an NDJSON (Newline Delimited JSON) format.

Task 1: Object Instances

Sub Task 1 : Records are filtered based on these criteria, which if satisfied are called clean records:
- word: Should contain alphabets ans whitespaces only
- countrycode: Should contain two uppercase letters only
- recognized: Should be a boolean value i.e. either true or false
- key_id: Should be a numeric string containing 16 characters only
Sub Task 2 : Count of words that belong to the cleaned dataset and that are recognized.
Sub Task 3 : Count of words from the cleaned records that are not recognized but fall on a weekend.

Cleaning of the dataset as per the criteria is done using REGEX during the mapping stage.

Arguments that are to be passed in the command line:
i. The word whose count is to be found and on which further processing is later done.

Task 2: Object Instances by Country

Sub Task 1 : Records are to be considered only if the Euclidean distance between the 0th coordinates of the first stroke (stroke refers to the "drawing" array in the dataset) and the origin is greater than a specific distance. This distance is passed in the command line.
Sub Task 2 : Count the number of occurrences of a specific word per country in the cleaned dataset.

Arguments that are to be passed in the command line:
i. The word whose count is to be found and on which further processing is later done.
ii. The minimum distance based on which records are filtered (as mentioned above).

Running the Project

Download JSON-Java from https://github.com/stleary/JSON-java. Add this library to Hadoop Classpath. This library will be used for extracting and parsing JSON.

export HADOOP_CLASSPATH="$JAVA_HOME/lib/tools.jar:json-java.jar"

Compile the Java file

bin/hadoop com.sun.tools.javac.Main <path to java file>

Create a jar

jar cf <name of jar>.jar <name of java file>*.class

Place the NDJSON file in the HDFS input folder

bin/hdfs dfs -mkdir -p input
bin/hdfs dfs -put input/plane_carriers.ndjson input

Execute
For Task 1:
bin/hadoop jar <path to jar file> <class name> /user/ubuntu/input/plane_carriers.ndjson user/ubuntu/output <word> OR
For Task 2:
bin/hadoop jar <path to jar file> <class name> /user/ubuntu/input/plane_carriers.ndjson user/ubuntu/output <word> <distance>
Example: word = airplane and distance = 100
Finally, the output can be seen as:

bin/hdfs dfs -cat user/ubuntu/output/part-r-00000

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Image Analysis using Spark		Image Analysis using Spark
Map Reduce using Hadoop		Map Reduce using Hadoop
Page Rank using Hadoop		Page Rank using Hadoop
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big-Data

Map Reduce using Hadoop (in Java)

Task 1: Object Instances

Task 2: Object Instances by Country

Running the Project

Page Rank using Hadoop

Image Analysis using Spark

About

Releases

Packages

Languages

sonamkshenoy/Big-Data

Folders and files

Latest commit

History

Repository files navigation

Big-Data

Map Reduce using Hadoop (in Java)

Task 1: Object Instances

Task 2: Object Instances by Country

Running the Project

Page Rank using Hadoop

Image Analysis using Spark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages