Skip to content

Latest commit

 

History

History
38 lines (24 loc) · 2.24 KB

README.md

File metadata and controls

38 lines (24 loc) · 2.24 KB

ArrowSAM

ArrowSAM is an in-memory Sequence Alignment/Map (SAM) representation which uses Apache Arrow framework (A cross-language development platform for in-memory data) and Plasma (Shared-Memory) Object Store to store and process SAM columnar data in-memory.

Citing ArrowSAM

The following paper describes the ArrowSAM format and its usage to speedup genomics pipelines. If you use ArrowSAM in your work, please cite the following paper.

Ahmad et al., (2020). "ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow", ICCAIS. doi.org/10.1109/ICCAIS48893.2020.9096725

Ahmad et al., "Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework", BMC Genomics, presented at APBC2020. https://doi.org/10.1186/s12864-020-07013-y

This repo contains following three components:

  1. ArrowSAM (In-memory SAM data representation) integrated BWA-MEM, Picard and GATK tools.

  2. A Singularity container def file (To create an environment to use all Apache Arrow related tools and libraries for ArrowSAM).

  3. Scripts to run different GATK best practices recommended workflows (using different in-memory data placement techniques like ArrowSAM, ramDisk and pipes for fast processing) to run complete DNA analysis pipeline efficiently.

Note: ArrowSAM and all other workflows are based on single node, multi-core machines.

How to run

  1. Install Singularity container

  2. Download our Singularity script and generate singularity image (this image contains all Arrow related packges necessary for building/compiling BWA-MEM, Picard and GATK)

  3. Now enter into generated image using command:

     sudo singularity shell <image_name>.simg
    
  4. Download BWA-MEM inside image

     git clone https://github.com/tahashmi/bwa.git
    
  5. Go into bwa dir and compile BWA-MEM:

     cd bwa
     make
    
  6. Now you can run BWA-MEM.