[ET4310] Supercomputing for Big Data

This repository contains my personal solutions to the 3 practical assignments of the ET4310 "Supercomputing for Big Data" course at TU Delft, taught in Q1 2016/17.

The three Lab folders contain the code and the report for each of the assignments.

First assignment

Part 1

Data exploration on Wikipedia page view statistics

Part 2

Using Spark streaming to collect tweets and compute statistics

Part 3

Exploration of the output of part 2 to compute more statistics

Second assignment

Using the IMDB dumps to compute the degrees of separation from Kevin Bacon to a specified actor

Third assignment

Cluster-Based Apache Spark implementation of the GATK DNA Analysis Pipeline

Part 1

Interleaving DNA reads from FASTQ files

Part 2

DNA analysis pipeline implementation

Part 3

Interacting with HDFS

License and Copyright

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Lab1		Lab1
Lab2		Lab2
Lab3		Lab3
.editorconfig		.editorconfig
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ET4310] Supercomputing for Big Data

First assignment

Part 1

Part 2

Part 3

Second assignment

Third assignment

Part 1

Part 2

Part 3

License and Copyright

About

Releases

Packages

Languages

License

joined/ET4310-SupercomputingForBigData

Folders and files

Latest commit

History

Repository files navigation

[ET4310] Supercomputing for Big Data

First assignment

Part 1

Part 2

Part 3

Second assignment

Third assignment

Part 1

Part 2

Part 3

License and Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages