Skip to content
Alex Paul edited this page May 7, 2018 · 2 revisions

Scalable Overlap-Graph Reduction Algorithms

Introduction

The advent of high-throughput and long fragmented DNA sequencing techniques has permitted nearly perfect or very high quality de novo assemblies of genomes. High-throughput sequencers generate a very large number of fragmented sequences with deep sequencing depth for high quality analysis, but raise an issue of requiring large amounts of computer memory to resolve the large genome graphs generated by most overlap graph de novo assemblers.

To address these limitations, we present a novel algorithmic approach; Scalable Overlap-graph Reduction Algorithms (SORA). SORA adapts string graph reduction algorithms for the genome assembly using a distributed computing platform. To efficiently compute coverage for enormous paths in the graphs, SORA uses Apache Spark which is a cluster-based engine designed on top of Hadoop to handle very large datasets in the cloud. The experimental results show that SORA can process a nearly one billion edge graph in a distributed cloud cluster as well as smaller graphs on a local cluster with a short turnaround time. Moreover, our algorithms scale almost linearly with increasing numbers of virtual instances on the cloud.

SORA is designed to be used in two different modules. The first module contains the Transitive Edge Reduction(TER). The second module contains the Dead End Removal(DER) and the Composite Edge Contraction(CEC).