Simple Sequence Repeats identification using a MapReduce based Distributed Implementation of K-mer algorithm

Overview

The goal of this project is to understand the performance and feasibility of finding Simple Sequence Repeats in biological sequences using a distributed architecture. The project aims to identify any potential benefits of using a distributed model rather than the traditional approach.

Contributors

Devanshu Singh
Rachana Gugale
Mohammad Uzair Fasih

Instructions for running the implementation

You will need to setup the project before using the make commands. See Installation and Setup

The code contains implementations of K-mer algorithm that run both in a distributed and centralized manner.

All the normal non-distributed implementations are hosted inside the standard-implementations folder. The distributed implementations are hosted in the map-reduce-python folder.

Datasets

The datasets used for implementing and testing the code is present in the dataset folder

Algorithms

K-mer Algorithm

To run the normal implementation of k-mer algorithm

make k-mer source=<source-file-path> kval=<size-of-k-mer> [output=<ouput-file-path>]

Distributed K-mer Algorithm

To run the distributed implmentation of k-mer algorithm

make hadoop-k-mer source=<source-file-path> kval=<size-of-k-mer> [jobcount=<number-of-files>]

Check the output of the distribted k-mer algorithm using the following commands

docker exec -it namenode bash
hadoop fs -cat "./output/*"

Installation

This project uses Hadoop to implement a distributed architecture. In order to ensure portability, this project uses docker to run hadoop and python based MapReduce.

In order to run this project, you need to install

Docker and Docker Compose (On Windows Machine, WSL2 is required)
Git

Setup

Clone this repo
Run the docker containers for docker

cd docker-hadoop
docker-compose up -d

After building the containers, use the following command to verify all the containers are up and running

docker ps

You should be able to see the following running containers

CONTAINER ID   IMAGE                           COMMAND                  CREATED              STATUS                        PORTS                                            NAMES
1250c84d3206   docker-hadoop_historyserver     "/entrypoint.sh /run…"   About a minute ago   Up About a minute (healthy)   0.0.0.0:8188->8188/tcp                           historyserver
332c9c511e1e   docker-hadoop_resourcemanager   "/entrypoint.sh /run…"   About a minute ago   Up About a minute (healthy)   0.0.0.0:8089->8088/tcp                           resourcemanager
1ae1d648dd42   docker-hadoop_nodemanager       "/entrypoint.sh /run…"   About a minute ago   Up About a minute (healthy)   0.0.0.0:8042->8042/tcp                           nodemanager
1ce3576b38c5   docker-hadoop_datanode          "/entrypoint.sh /run…"   About a minute ago   Up About a minute (healthy)   9864/tcp                                         datanode
efab4803b9c6   docker-hadoop_namenode          "/entrypoint.sh /run…"   4 minutes ago        Up About a minute (healthy)   0.0.0.0:9000->9000/tcp, 0.0.0.0:9870->9870/tcp   namenode

Or if you have Docker Desktop

Also visit Hadoop Dashboard by going to http://localhost:9870

You can tear down your runnning containers using the following command.

docker-compose down

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset		dataset
docker-hadoop		docker-hadoop
docs		docs
images		images
map-reduce-python		map-reduce-python
output		output
standard-implementations		standard-implementations
utils		utils
.gitignore		.gitignore
Bioinformatics Project Report.pdf		Bioinformatics Project Report.pdf
README.md		README.md
env.sh		env.sh
makefile		makefile
temp_env.sh		temp_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Sequence Repeats identification using a MapReduce based Distributed Implementation of K-mer algorithm

Overview

Contributors

Instructions for running the implementation

Datasets

Algorithms

K-mer Algorithm

Distributed K-mer Algorithm

Installation

Setup

About

Releases

Packages

Contributors 2

Languages

UFBioinformatics/distributed-tandem-repeats

Folders and files

Latest commit

History

Repository files navigation

Simple Sequence Repeats identification using a MapReduce based Distributed Implementation of K-mer algorithm

Overview

Contributors

Instructions for running the implementation

Datasets

Algorithms

K-mer Algorithm

Distributed K-mer Algorithm

Installation

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages