Skip to content

Getting started

martineh edited this page Mar 18, 2015 · 36 revisions

You can learn how to use some of the most important features in this 5-minutes tutorial. It is assumed that you have HPG Aligner installed, otherwise please visit the Downloads section in order to install the binaries or to compile the source code. This tutorial uses the most recent version of HPG Aligner, command lines and parameters for other versions can differ slightly.

To run this worked example you also need:

Preparing the environment

Create a folder and download there the compressed FASTQ genomic DNA dataset and human chromosome, then uncompress those files:

mkdir tutorial
cd tutorial
tar zxvf test_paired_chr20.tar.gz
tar zxvf homo_sapiens.grch37.70.dna.chromosome.20.fa.tar.gz

HPG Aligner binary is assumed to be in the same directory than data in this tutorial, so copy it into the folder. Now you should have these files:

test_1.fq
test_2.fq
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa
hpg-aligner

Building the index

Before mapping, we need to create the index. HPG Aligner is one of the fastest tools creating index, we use multi-thread implementation to speed up this process, this process may take some time and need a lot memory depending on the size of the genome. You should not have any problem with this chromosome. You need to execute the command build-sa-index and to specify the FASTA genome and the output directory for the index:

mkdir chr20-index
./hpg-aligner build-sa-index -g Homo_sapiens.GRCh37.70.dna.chromosome.20.fa -i chr20-index/

If the index generation succeeds, you'll have the following files in the folder chr20-index:

dna_compression.bin
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.A
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.CHROM
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.IA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.JA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.S
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.SA
index
params.info
params.txt

DNA Aligning

To map the FASTQ files you need to run the tool with command dna as the dataset contains genomic DNA sequences, the index folder is passed as argument, the default options should provide a good performance, i.e. all available cores will be used to speed up performance. We create a folder called mapped to store the results:

mkdir mapped
./hpg-aligner dna -f test_1.fq -i chr20-index -o mapped

The execution should not take too much time as files are small, if you have a multi-core machine you can check that all cores are being used with the command htop. HPG Aligner provides a small report:

----------------------------------------------
Loading SA tables...
End of loading SA tables in 0.02 min. Done!!
----------------------------------------------
Starting mapping...
End of mapping in 0.03 min. Done!!
----------------------------------------------
Output file        : mapped/alignments.sam

Num. reads         : 20359
Num. mapped reads  : 20357 (99.99 %)
Num. unmapped reads: 2 (0.01 %)

Num. mappings      : 21106
Num. multihit reads: 145
----------------------------------------------

The output file containing the resulting mappings is located in the folder specified by the parameter -o, in our case, the folder mapped:

ls -l mapped
alignments.sam

As you can see, by default SAM is the output file format, but by using the parameter --bam-format the output format will be BAM. This parameter turns the process slower.

For paired-end mapping, you have to use the parameter -j for the second mate file.

./hpg-aligner dna -f test_1.fq -j test_2.fq -i chr20-index -o mapped

----------------------------------------------
Loading SA tables...
End of loading SA tables in 0.02 min. Done!!
----------------------------------------------
Starting mapping...
End of mapping in 0.04 min. Done!!
----------------------------------------------
Output file        : mapped/alignments.sam

Num. reads         : 40718
Num. mapped reads  : 40717 (100.00 %)
Num. unmapped reads: 1 (0.00 %)

Num. mappings      : 41075
Num. multihit reads: 136
----------------------------------------------

To see all available parameters, type:

./hpg-aligner -h

RNA-seq Aligning

To map the FASTQ files you need to run the tool with command rna as the dataset, the index folder is passed as argument, the default options should provide a good performance, i.e. all available cores will be used to speed up performance. We create a folder called mapped to store the results:

mkdir mapped
./hpg-aligner rna -f test_1.fq -i chr20-index -o mapped

+===============================================================+
|                           RNA MODE                            |
+===============================================================+
|      ___  ___  ___                                            |
|    \/ H \/ P \/ G \/                                          |
|    /\___/\___/\___/\                                          |
|      ___  ___  ___  ___  ___  ___  ___                        |
|    \/ A \/ L \/ I \/ G \/ N \/ E \/ R \/                      |
|    /\___/\___/\___/\___/\___/\___/\___/\                      |
|                                                               |
+===============================================================+

Load Genome Status
[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100%

Mapping Status (First Phase)
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100%

Mapping Status (Second Phase)
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100%

+===============================================================+
|                        GLOBAL STATISTICS                      |
+===============================================================+
Loading Time (s)      : 0.92 (s)
Alignment Time (s)    : 0.24 (s)
Total Time (s)        : 1.16 (s)
+===============================================================+
Total Reads Processed : 20359
+===============================================================+
Total Reads Mapped in First State   : 13192 (64.80%)
+-------------------------------------------+-------------------+
Total Reads Mapped in Second State  : 7167 (35.20%)
+-------------------------------------------+-------------------+
Total Reads Mapped                   : 20279 (99.61%)
Total Reads Unmapped                 : 80 (0.39%)
Total Reads with one alignment       : 20127 (98.86%)
Total Reads with multiple alignment  : 152 (0.75%)
+===============================================================+
|    S P L I C E    J U N C T I O N S    S T A T I S T I C S    |
+===============================================================+
Total splice junctions                  :  109
Total cannonical splice junctions       :  57 (52.29%)
Total semi-cannonical splice junctions  :  52 (47.71%)
+===============================================================+

In addtion to the alignments.sam that contains the found mappings, the mode RNA creates the file exact_junctions.bed to store the junctions.

head mapped/exact_junctions.bed

1	788280	902890	JUNCTION_0	1	+	GC-AG
1	2098272	2165953	JUNCTION_1	1	+	GT-AG
1	2140550	2387339	JUNCTION_2	1	+	GC-AG
1	2203605	2244638	JUNCTION_3	1	+	GC-AG
1	2456314	2511821	JUNCTION_4	1	+	GC-AG
1	2572619	2581286	JUNCTION_5	1	+	GT-AG
1	2629112	2629360	JUNCTION_6	1	+	GT-AG
1	2662365	2765776	JUNCTION_7	1	+	GT-AG
1	2890164	2927429	JUNCTION_8	1	+	GT-AG
1	2970031	3038288	JUNCTION_9	1	+	GT-AG

For paired-end mapping, use the parameter -j:

./hpg-aligner rna -f test_1.fq -j test_2.fq -i chr20-index -o mapped

MPI RNA-seq Aligning

You can follow the same steps that RNA-seq Aligning, but you should run the application with mpirun command. The number of process must be one more than the nodes that will be used (if you want use two nodes the number of process must be three) and the first and last process should be run in the same node:

Example for run with MPI RNA and two nodes:

mpirun -np 3 -hosts compute-0,compute-1,compute-0 ./bin/hpg-aligner rna -i index_path/ -f reads.fq

Example for run with MPI RNA and four nodes:

mpirun -np 5 -hosts compute-0,compute-1,compute-2,compute-3,compute-0 ./bin/hpg-aligner rna -i index_path/ -f reads.fq

ATENTION: Mvapich2 has an incompatibility with tcmalloc. You can install mvapich2 with --disable-registration-cache option or before you run MPI HPG Aligner execute this:

export MV2_USE_LAZY_MEM_UNREGISTER=0

If you compile MPI HPG Aligner without tcmalloc the perferomance will be degraded.