Course work for Big Data Analytics with Spark
This repository contains course assignments completed as part of the eDX course Big Data Analysis with Spark, completed in Dec '18.
Given an input file with an arbitrary set of co-ordinates, the task is to use pyspark library functions and write a program in python3 to find if three or more points are collinear.
This project attempts to classify geographical locations according to their predicted tree cover using Gradient Boosting and Random Forest classifiers.
This project estimates intrinsic dimensions by calculating the Mean Squared Distance of the entire dataset to their representative centers. We use the K-Means API in spark to find representative centers.
This assignment covers a set of steps to analyze Twitter feed data.
- Parsing JSON strings to JSON objects
- Number of posts from each user partition
- Tokens that are relatively opular in each user partition
Tensorflow code to distinguish between a signal process which produces Higgs bosons and a background process which does not. We model this problem as a binary classification problem.