Simple Sequence Repeats identification using a MapReduce based Distributed Implementation of K-mer algorithm
The goal of this project is to understand the performance and feasibility of finding Simple Sequence Repeats in biological sequences using a distributed architecture. The project aims to identify any potential benefits of using a distributed model rather than the traditional approach.
- Devanshu Singh
- Rachana Gugale
- Mohammad Uzair Fasih
You will need to setup the project before using the make commands. See Installation and Setup
The code contains implementations of K-mer algorithm that run both in a distributed and centralized manner.
All the normal non-distributed implementations are hosted inside the standard-implementations
folder. The distributed implementations are hosted in the map-reduce-python
folder.
The datasets used for implementing and testing the code is present in the dataset folder
To run the normal implementation of k-mer algorithm
make k-mer source=<source-file-path> kval=<size-of-k-mer> [output=<ouput-file-path>]
To run the distributed implmentation of k-mer algorithm
make hadoop-k-mer source=<source-file-path> kval=<size-of-k-mer> [jobcount=<number-of-files>]
Check the output of the distribted k-mer algorithm using the following commands
docker exec -it namenode bash
hadoop fs -cat "./output/*"
This project uses Hadoop to implement a distributed architecture. In order to ensure portability, this project uses docker to run hadoop and python based MapReduce.
In order to run this project, you need to install
- Docker and Docker Compose (On Windows Machine, WSL2 is required)
- Git
- Clone this repo
- Run the docker containers for docker
cd docker-hadoop
docker-compose up -d
- After building the containers, use the following command to verify all the containers are up and running
docker ps
You should be able to see the following running containers
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1250c84d3206 docker-hadoop_historyserver "/entrypoint.sh /run…" About a minute ago Up About a minute (healthy) 0.0.0.0:8188->8188/tcp historyserver
332c9c511e1e docker-hadoop_resourcemanager "/entrypoint.sh /run…" About a minute ago Up About a minute (healthy) 0.0.0.0:8089->8088/tcp resourcemanager
1ae1d648dd42 docker-hadoop_nodemanager "/entrypoint.sh /run…" About a minute ago Up About a minute (healthy) 0.0.0.0:8042->8042/tcp nodemanager
1ce3576b38c5 docker-hadoop_datanode "/entrypoint.sh /run…" About a minute ago Up About a minute (healthy) 9864/tcp datanode
efab4803b9c6 docker-hadoop_namenode "/entrypoint.sh /run…" 4 minutes ago Up About a minute (healthy) 0.0.0.0:9000->9000/tcp, 0.0.0.0:9870->9870/tcp namenode
Also visit Hadoop Dashboard by going to http://localhost:9870
- You can tear down your runnning containers using the following command.
docker-compose down