This repository contains my personal solutions to the 3 practical assignments of the ET4310 "Supercomputing for Big Data" course at TU Delft, taught in Q1 2016/17.
The three Lab
folders contain the code and the report for each of the assignments.
Data exploration on Wikipedia page view statistics
Using Spark streaming to collect tweets and compute statistics
Exploration of the output of part 2 to compute more statistics
Using the IMDB dumps to compute the degrees of separation from Kevin Bacon to a specified actor
Cluster-Based Apache Spark implementation of the GATK DNA Analysis Pipeline
Interleaving DNA reads from FASTQ files
DNA analysis pipeline implementation
Interacting with HDFS
See LICENSE
.