- version: 1.1 (based on NGS Analyzer 1.0)
- update: 2019/03/20
- contact: [email protected]
This miniapp is based on a genome analysis tool, NGS Analyzer, which has been developed in Center for Genomic Medicine, RIKEN. NGS Analyzer analyzes output data generated by a next-generation genome sequencer in high speed and identifies genetic differences among persons or cancer cell's mutations much precisely.
Compiling this program requires a C/C++ compiler which supports OpenMP, an MPI library and Gnu Make. This package provides Makefiles for K computer/Fujitsu FX10 and Intel X86_64 architecture machines.
-
Download this package and extract its contents
-
make
-
K computer/FX10
$ make -f makefile.k
-
X86_64 machines with GCC
$ make -f makefile.x86_64_gcc
-
You can find executables under 'bin/' directory if make succeeds.
As this package does not provide input data, you need to create it. To create data, refer to README_DATA.md.
This miniapp provides shellscripts that execute a genome analysis workflow of the original NGS Analyzer. These shellscripts consist of two programs: a script that executes the whole workflow and three scripts that execute each step of the workflow.
The workflow consists of the following steps.
-
Mapping
Performs sequence alignment, that is, it calculates the best matches for the input genome and reference genome. After alignment, it splits the results into files based on contigs (DNA segments). -
Merge
Merges the alignment results per contigs. This step is performed when the input genome is split and then aligned in parallel. -
Remove duplicates
Removes duplicated sequence patterns from the alignment results. -
Detect mutations
Statistically identifies genetic differences.
Each step is implemented by using well-known and widely-used alignment tool and generic alignment format tool, etc.
This package provides a program that sequentially executes Mapping, Merge, Remove duplicates and Detect mutation as 'workflow'. 'workflow' can be run in parallel using multiple processes using MPI.
-
Place input files
Create directories whose names are MPI rank numbers, and then copy file pairs under ./input/00-read directory. In the following example, we assume that 12 gene file pairs are processed by 6 nodes (MPI processes).$ cd ./input $ mkdir -p 00-read-rank/0 $ mkdir -p 00-read-rank/1 $ mkdir -p 00-read-rank/2 $ mkdir -p 00-read-rank/3 $ mkdir -p 00-read-rank/4 $ mkdir -p 00-read-rank/5 $ cp 00-read/part_1.00 00-read-rank/0 $ cp 00-read/part_2.00 00-read-rank/0 $ cp 00-read/part_1.01 00-read-rank/0 $ cp 00-read/part_2.01 00-read-rank/0 $ cp 00-read/part_1.02 00-read-rank/1 $ cp 00-read/part_2.02 00-read-rank/1 $ cp 00-read/part_1.03 00-read-rank/1 $ cp 00-read/part_2.03 00-read-rank/1 $ cp 00-read/part_1.04 00-read-rank/2 $ cp 00-read/part_2.04 00-read-rank/2 $ cp 00-read/part_1.05 00-read-rank/2 $ cp 00-read/part_2.05 00-read-rank/2 $ cp 00-read/part_1.06 00-read-rank/3 $ cp 00-read/part_2.06 00-read-rank/3 $ cp 00-read/part_1.07 00-read-rank/3 $ cp 00-read/part_2.07 00-read-rank/3 $ cp 00-read/part_1.08 00-read-rank/4 $ cp 00-read/part_2.08 00-read-rank/4 $ cp 00-read/part_1.09 00-read-rank/4 $ cp 00-read/part_2.09 00-read-rank/4 $ cp 00-read/part_1.10 00-read-rank/5 $ cp 00-read/part_2.10 00-read-rank/5 $ cp 00-read/part_1.11 00-read-rank/5 $ cp 00-read/part_2.11 00-read-rank/5 $ cd ..
-
Run
Run as an interactive job.$ mpiexec -n 6 ./bin/workflow ./input/bwa_db/reference.fa \ ./input/seq_contig.md ./input/reference.fa ./input/reference.fa.fai \ ./input/00-read-rank
After job completes, workflow_OUT directory is created in the current directory and you can find output in the directory.
A sample jobscript for K computer that performs the above procedure 1 and 2 using rank-directory is provided as k_job_script/workflow-job.sh. This jobscript uses 12 nodes.
$ pjsub ./k_job_script/workflow-job.sh
This package provides three scripts each of which runs Mapping, Remove duplicates and Detect mutations. We can estimate the performance of large scale analysis by running and profiling these scripts.
- Input files
- Target read pair (two files)
- Reference sequence file formatted for BWA(six files)
- Information on contigs in the reference sequence (one file)
- Output files
- Mapping results split based on Contigs
- Intermediate files
- Suffix Array (SA) files of the input read pair (two files)
- bwa alignment result (one file)
Run bin/01-alignment.sh. You can specify the number of threads to calculate SA as a parameter. Default is 8.
$ ./bin/01-alignment.sh -t 8 ./input/bwa_db/reference.fa \
./input/seq_contig.md ./input/01-alignment/sra_1.fastq.0 \
./input/01-alignment/sra_2.fastq.0
The intermediate files are generated in 01-alignment.sh_MED/ directory and the output files are generated in 01-alignment.sh_OUT directory.
A jobscript for K computer is provided as k_job_script/01-alignment-job.sh.
- Input files
- SAM file
- Index of reference sequence
- Output file
- BAM file (duplicates are removed)
- Intermediate files
- BAM file(include duplicates)
- Sorted BAM fileM(include duplicates)
Run bin/02-rmdup.sh. You can specify the amount of memory used in sort. Default is 800MB.
$ ./bin/02-rmdup.sh ./input/reference.fa.fai \
./input/02-rmdup/NT_008818.16.sam
The intermediate files are generated in 02-rmdup.sh_MED/ directory and the output files are generated in 02-rmdup.sh_OUT/ directory.
A jobscript for K computer is provided as k_job_script/02-rmdup-job.sh.
- Input files
- BAM file
- Reference sequence
- Output files
- Indel, SNP and summary files
- Intermediate files
- pileup file
Run bin/03-snp.sh.
$ ./bin/03-snp.sh ./input/reference.fa ./input/03-snp/NT_008818.16.bam
The intermediate files are generated in 03-snp.sh_MED/ directory and the output files are generated in 03-snp.sh_OUT/ directory.
A jobscript for K computer is provided as k_job_script/03-snp-job.sh.
This package also provides a method for estimating the performance of large scale analysis using the results of small analysis obtained by provided scripts. You can estimate it by running each program in 'Running each step of the workflow', obtaining their profiles and filling them in an attached Excel sheet, model/perf-model.xlsx.
The model has the following assumption.
- The model does not assume load on the storage devices. As a result, the calculated IO performance means the maximum required performance. Moreover, there will be large difference between the model and the real execution time when multiple nodes are used.
- The cost for 'Merge' is not included in the model.
- In 'Remove duplicates' and 'Detect mutations' step, depending on patterns of input genome, sizes of input files widely vary. This model shows the estimated performance of a case where size of all files are almost same and it does not show that of a case where size of files differ.
You need to obtain elapse time, FLOPS, MIPS and memory throughput of the above three programs in 'Running each step of the workflow' by using profiler. You also need to count size of input/intermediate/output files of each program.
This package provides scripts for obtaining these profiles on K computer and Fujitsu FX10 using Fujitsu profiler. You can use them as follow.
-
Execute each program with '-p' profiling option.
$ ./bin/01-alignment.sh -p -t 8 ./input/bwa_db/reference.fa \ ./input/seq_contig.md ./input/01-alignment/sra_1.fastq.0 \ ./input/01-alignment/sra_2.fastq.0 $ ./bin/02-rmdup.sh ../data/reference.fa.fai \ ../data/02-rmdup/NT_005120.16.sam $ ./bin/03-snp.sh ../data/reference.fa ../data/03-snp/NT_005120.16.bam
By executing with this option, profiles (01-alignment.sh_PROF, 02-rmdup.sh_PROF, 03-snp.sh_PROF) are generated as well as intermediate and output files. Jobscripts for K computer are provided as k_job_script/01-alignment-job-profile.sh, k_job_script/02-rmdup-job-profile.sh and k_job_script/03-snp-job-profile.sh.
-
Obtain information by running profile parser.
$ ./bin/parse-01-alignment.rb ./input/01-alignment/sra_1.fastq.0 \ ./input/01-alignment/sra_2.fastq.0 01-alignment-job.sh_MED \ 01-alignment-job.sh_OUT 01-alignment-job.sh_PROF $ ./bin/parse-02-rmdup.rb ./input/02-rmdup/NT_008818.16.sam \ ./02-rmdup-job.sh_MED ./02-rmdup-job.sh_OUT ./02-rmdup-job.sh_PROF $ ./bin/parse-03-snp.rb ./input/03-snp/NT_008818.16.bam \ ./03-snp-job.sh_MED ./03-snp-job.sh_OUT ./03-snp-job.sh_PROF
Fill results in highlighted cells in page 2-4 of the Excel sheet.
Fill number of input files and number of CPUs of each step in highlighted cells in the 'Parameters for a large scale run' table in the Summary sheet. Then values in 'Calculated performance of a large scale run' table are automatically updated.
This model calculates the required performance of each workflow step when the execution should be completed within the specified elapse time.
Fill elapse time and number of files of each step in highlighted cells in the 'Parameters for time restriction' table in the Summary sheet. Then values in 'Estimated performance under given time threshold' table are automatically updated.
This model calculates the required IO performance of stage-in/out of input/output files on a system that employs staging when the staging should be completed within the specified time.
Fill required time for stage-in/out, number of target files and number of nodes in highlighted cells in 'Parameters for time restriction - Staging' table in the Summary sheet. Then values in 'Estimated time for staging under time threshold' table are automatically updated.
- Size of one genome data
- 100 times bigger than current size (100TB/human)
- Number of genomes: 200,000
- Size of total data is 20,000 times bigger than current size (2,000PB)
- Required computational power
- When analyzing in three years, 190 genomes should be processed a day.
- 455 seconds/genome
- When analyzing in five years, 110 genomes should be processed a day.
- 785 seconds/genome
- The analyses can be performed in multiple sites in parallel.
- When analyzing in three years, 190 genomes should be processed a day.
- Remove unnecessary sub-programs
- Remove unnecessary codes
- Reduce command line options
- Remove comments and never-executed codes
- remove shellscripts and configuration files used when running the original NGS Analyzer on K computer
- Implement scripts for executing the genome analysis workflow
- a script for running the whole workflow with multiple MPI processes
- scripts for individually running major steps of the workflow