Skip to content

fiber-miniapp/ngsa-mini

Repository files navigation

NGS Analyzer-MINI

About NGS Analyzer

This miniapp is based on a genome analysis tool, NGS Analyzer, which has been developed in Center for Genomic Medicine, RIKEN. NGS Analyzer analyzes output data generated by a next-generation genome sequencer in high speed and identifies genetic differences among persons or cancer cell's mutations much precisely.

Original NGS Analyzer

Install

Compile

Compiling this program requires a C/C++ compiler which supports OpenMP, an MPI library and Gnu Make. This package provides Makefiles for K computer/Fujitsu FX10 and Intel X86_64 architecture machines.

  1. Download this package and extract its contents

  2. make

    • K computer/FX10

         $ make -f makefile.k
      
    • X86_64 machines with GCC

         $ make -f makefile.x86_64_gcc
      

You can find executables under 'bin/' directory if make succeeds.

Prepare data

As this package does not provide input data, you need to create it. To create data, refer to README_DATA.md.

What NGS Analyzer-MINI does - brief explanation

This miniapp provides shellscripts that execute a genome analysis workflow of the original NGS Analyzer. These shellscripts consist of two programs: a script that executes the whole workflow and three scripts that execute each step of the workflow.

The workflow consists of the following steps.

  1. Mapping
    Performs sequence alignment, that is, it calculates the best matches for the input genome and reference genome. After alignment, it splits the results into files based on contigs (DNA segments).

  2. Merge
    Merges the alignment results per contigs. This step is performed when the input genome is split and then aligned in parallel.

  3. Remove duplicates
    Removes duplicated sequence patterns from the alignment results.

  4. Detect mutations
    Statistically identifies genetic differences.

Each step is implemented by using well-known and widely-used alignment tool and generic alignment format tool, etc.

How to Run the program

Running the whole workflow

This package provides a program that sequentially executes Mapping, Merge, Remove duplicates and Detect mutation as 'workflow'. 'workflow' can be run in parallel using multiple processes using MPI.

How to run

  1. Place input files
    Create directories whose names are MPI rank numbers, and then copy file pairs under ./input/00-read directory. In the following example, we assume that 12 gene file pairs are processed by 6 nodes (MPI processes).

     $ cd ./input
     $ mkdir -p 00-read-rank/0
     $ mkdir -p 00-read-rank/1
     $ mkdir -p 00-read-rank/2
     $ mkdir -p 00-read-rank/3
     $ mkdir -p 00-read-rank/4
     $ mkdir -p 00-read-rank/5
     $ cp 00-read/part_1.00 00-read-rank/0
     $ cp 00-read/part_2.00 00-read-rank/0
     $ cp 00-read/part_1.01 00-read-rank/0
     $ cp 00-read/part_2.01 00-read-rank/0
     $ cp 00-read/part_1.02 00-read-rank/1
     $ cp 00-read/part_2.02 00-read-rank/1
     $ cp 00-read/part_1.03 00-read-rank/1
     $ cp 00-read/part_2.03 00-read-rank/1
     $ cp 00-read/part_1.04 00-read-rank/2
     $ cp 00-read/part_2.04 00-read-rank/2
     $ cp 00-read/part_1.05 00-read-rank/2
     $ cp 00-read/part_2.05 00-read-rank/2
     $ cp 00-read/part_1.06 00-read-rank/3
     $ cp 00-read/part_2.06 00-read-rank/3
     $ cp 00-read/part_1.07 00-read-rank/3
     $ cp 00-read/part_2.07 00-read-rank/3
     $ cp 00-read/part_1.08 00-read-rank/4
     $ cp 00-read/part_2.08 00-read-rank/4
     $ cp 00-read/part_1.09 00-read-rank/4
     $ cp 00-read/part_2.09 00-read-rank/4
     $ cp 00-read/part_1.10 00-read-rank/5
     $ cp 00-read/part_2.10 00-read-rank/5
     $ cp 00-read/part_1.11 00-read-rank/5
     $ cp 00-read/part_2.11 00-read-rank/5
     $ cd ..
    
  2. Run
    Run as an interactive job.

     $ mpiexec -n 6 ./bin/workflow ./input/bwa_db/reference.fa \
     ./input/seq_contig.md ./input/reference.fa ./input/reference.fa.fai \
     ./input/00-read-rank
    

    After job completes, workflow_OUT directory is created in the current directory and you can find output in the directory.

A sample jobscript for K computer that performs the above procedure 1 and 2 using rank-directory is provided as k_job_script/workflow-job.sh. This jobscript uses 12 nodes.

    $ pjsub ./k_job_script/workflow-job.sh

Running each step of the workflow

This package provides three scripts each of which runs Mapping, Remove duplicates and Detect mutations. We can estimate the performance of large scale analysis by running and profiling these scripts.

Mapping

File IO
  • Input files
    • Target read pair (two files)
    • Reference sequence file formatted for BWA(six files)
    • Information on contigs in the reference sequence (one file)
  • Output files
    • Mapping results split based on Contigs
  • Intermediate files
    • Suffix Array (SA) files of the input read pair (two files)
    • bwa alignment result (one file)
How to run

Run bin/01-alignment.sh. You can specify the number of threads to calculate SA as a parameter. Default is 8.

$ ./bin/01-alignment.sh -t 8 ./input/bwa_db/reference.fa \
./input/seq_contig.md ./input/01-alignment/sra_1.fastq.0 \
./input/01-alignment/sra_2.fastq.0

The intermediate files are generated in 01-alignment.sh_MED/ directory and the output files are generated in 01-alignment.sh_OUT directory.

A jobscript for K computer is provided as k_job_script/01-alignment-job.sh.

Remove duplicates

File IO
  • Input files
    • SAM file
    • Index of reference sequence
  • Output file
    • BAM file (duplicates are removed)
  • Intermediate files
    • BAM file(include duplicates)
    • Sorted BAM fileM(include duplicates)
How to run

Run bin/02-rmdup.sh. You can specify the amount of memory used in sort. Default is 800MB.

$ ./bin/02-rmdup.sh ./input/reference.fa.fai \
./input/02-rmdup/NT_008818.16.sam

The intermediate files are generated in 02-rmdup.sh_MED/ directory and the output files are generated in 02-rmdup.sh_OUT/ directory.

A jobscript for K computer is provided as k_job_script/02-rmdup-job.sh.

Detect mutations

File IO
  • Input files
    • BAM file
    • Reference sequence
  • Output files
    • Indel, SNP and summary files
  • Intermediate files
    • pileup file
How to run

Run bin/03-snp.sh.

$ ./bin/03-snp.sh ./input/reference.fa ./input/03-snp/NT_008818.16.bam

The intermediate files are generated in 03-snp.sh_MED/ directory and the output files are generated in 03-snp.sh_OUT/ directory.

A jobscript for K computer is provided as k_job_script/03-snp-job.sh.

Estimate the performance of large scale analysis

This package also provides a method for estimating the performance of large scale analysis using the results of small analysis obtained by provided scripts. You can estimate it by running each program in 'Running each step of the workflow', obtaining their profiles and filling them in an attached Excel sheet, model/perf-model.xlsx.

The model has the following assumption.

  • The model does not assume load on the storage devices. As a result, the calculated IO performance means the maximum required performance. Moreover, there will be large difference between the model and the real execution time when multiple nodes are used.
  • The cost for 'Merge' is not included in the model.
  • In 'Remove duplicates' and 'Detect mutations' step, depending on patterns of input genome, sizes of input files widely vary. This model shows the estimated performance of a case where size of all files are almost same and it does not show that of a case where size of files differ.

Obtain profiles

You need to obtain elapse time, FLOPS, MIPS and memory throughput of the above three programs in 'Running each step of the workflow' by using profiler. You also need to count size of input/intermediate/output files of each program.

This package provides scripts for obtaining these profiles on K computer and Fujitsu FX10 using Fujitsu profiler. You can use them as follow.

  1. Execute each program with '-p' profiling option.

     $ ./bin/01-alignment.sh -p -t 8 ./input/bwa_db/reference.fa \
     ./input/seq_contig.md ./input/01-alignment/sra_1.fastq.0 \
     ./input/01-alignment/sra_2.fastq.0
     $ ./bin/02-rmdup.sh ../data/reference.fa.fai \
     ../data/02-rmdup/NT_005120.16.sam
     $ ./bin/03-snp.sh ../data/reference.fa ../data/03-snp/NT_005120.16.bam
    

    By executing with this option, profiles (01-alignment.sh_PROF, 02-rmdup.sh_PROF, 03-snp.sh_PROF) are generated as well as intermediate and output files. Jobscripts for K computer are provided as k_job_script/01-alignment-job-profile.sh, k_job_script/02-rmdup-job-profile.sh and k_job_script/03-snp-job-profile.sh.

  2. Obtain information by running profile parser.

     $ ./bin/parse-01-alignment.rb ./input/01-alignment/sra_1.fastq.0 \
     ./input/01-alignment/sra_2.fastq.0 01-alignment-job.sh_MED \
     01-alignment-job.sh_OUT 01-alignment-job.sh_PROF
     $ ./bin/parse-02-rmdup.rb ./input/02-rmdup/NT_008818.16.sam \
     ./02-rmdup-job.sh_MED ./02-rmdup-job.sh_OUT ./02-rmdup-job.sh_PROF
     $ ./bin/parse-03-snp.rb ./input/03-snp/NT_008818.16.bam \
     ./03-snp-job.sh_MED ./03-snp-job.sh_OUT ./03-snp-job.sh_PROF
    

Fill results in highlighted cells in page 2-4 of the Excel sheet.

Estimate the performance of large scale analysis

Fill number of input files and number of CPUs of each step in highlighted cells in the 'Parameters for a large scale run' table in the Summary sheet. Then values in 'Calculated performance of a large scale run' table are automatically updated.

Estimate the required performance when maximum elapse time is given

This model calculates the required performance of each workflow step when the execution should be completed within the specified elapse time.

Fill elapse time and number of files of each step in highlighted cells in the 'Parameters for time restriction' table in the Summary sheet. Then values in 'Estimated performance under given time threshold' table are automatically updated.

Estimate the required performance for staging when maximum staging time is given

This model calculates the required IO performance of stage-in/out of input/output files on a system that employs staging when the staging should be completed within the specified time.

Fill required time for stage-in/out, number of target files and number of nodes in highlighted cells in 'Parameters for time restriction - Staging' table in the Summary sheet. Then values in 'Estimated time for staging under time threshold' table are automatically updated.

Target exa-scale problem setting

  • Size of one genome data
    • 100 times bigger than current size (100TB/human)
  • Number of genomes: 200,000
    • Size of total data is 20,000 times bigger than current size (2,000PB)
  • Required computational power
    • When analyzing in three years, 190 genomes should be processed a day.
      • 455 seconds/genome
    • When analyzing in five years, 110 genomes should be processed a day.
      • 785 seconds/genome
    • The analyses can be performed in multiple sites in parallel.

Changes from original NGS Analyzer

  • Remove unnecessary sub-programs
  • Remove unnecessary codes
    • Reduce command line options
    • Remove comments and never-executed codes
  • remove shellscripts and configuration files used when running the original NGS Analyzer on K computer
  • Implement scripts for executing the genome analysis workflow
    • a script for running the whole workflow with multiple MPI processes
    • scripts for individually running major steps of the workflow