-
Notifications
You must be signed in to change notification settings - Fork 13
Getting started
You can learn how to use some of the most important features in this 5-minutes tutorial. It is assumed that you have HPG Aligner installed, otherwise please visit the Downloads section in order to install the binaries or to compile the source code. This tutorial uses the most recent version of HPG Aligner, command lines and parameters for other versions can differ slightly.
To run this worked example you also need:
- A small FASTQ paired-end dataset from Chromosome 20: test_paired_chr20.tar.gz
- Human chromosome 20 reference sequence: homo_sapiens.grch37.70.dna.chromosome.20.fa.tar.gz
Create a folder and download there the compressed FASTQ genomic DNA dataset and human chromosome, then uncompress those files:
mkdir tutorial
cd tutorial
tar zxvf test_paired_chr20.tar.gz
tar zxvf homo_sapiens.grch37.70.dna.chromosome.20.fa.tar.gz
HPG Aligner binary is assumed to be in the same directory than data in this tutorial, so copy it into the folder. Now you should have these files:
test_1.fq
test_2.fq
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa
hpg-aligner
Before mapping, we need to create the index. HPG Aligner is one of the fastest tools creating index, we use multi-thread implementation to speed up this process, this process may take some time and need a lot memory depending on the size of the genome. You should not have any problem with this chromosome. You need to execute the command build-sa-index and to specify the FASTA genome and the output directory for the index:
mkdir chr20-index
./hpg-aligner build-sa-index -g Homo_sapiens.GRCh37.70.dna.chromosome.20.fa -i chr20-index/
If the index generation succeeds, you'll have the following files in the folder chr20-index
:
dna_compression.bin
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.A
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.CHROM
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.IA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.JA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.S
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.SA
index
params.info
params.txt
To map the FASTQ files you need to run the tool with command dna as the dataset contains genomic DNA sequences, the index folder is passed as argument, the default options should provide a good performance, i.e. all available cores will be used to speed up performance. We create a folder called mapped to store the results:
mkdir mapped
./hpg-aligner dna -f test_1.fq -i chr20-index -o mapped
The execution should not take too much time as files are small, if you have a multi-core machine you can check that all cores are being used with the command htop. HPG Aligner provides a small report:
----------------------------------------------
Loading SA tables...
End of loading SA tables in 0.02 min. Done!!
----------------------------------------------
Starting mapping...
End of mapping in 0.03 min. Done!!
----------------------------------------------
Output file : mapped/alignments.sam
Num. reads : 20359
Num. mapped reads : 20357 (99.99 %)
Num. unmapped reads: 2 (0.01 %)
Num. mappings : 21106
Num. multihit reads: 145
----------------------------------------------
The output file containing the resulting mappings is located in the folder specified by the parameter -o, in our case, the folder mapped:
ls -l mapped
alignments.sam
As you can see, by default SAM is the output file format, but by using the parameter --bam-format the output format will be BAM. This parameter turns the process slower.
For paired-end mapping, you have to use the parameter -j for the second mate file.
./hpg-aligner dna -f test_1.fq -j test_2.fq -i chr20-index -o mapped
----------------------------------------------
Loading SA tables...
End of loading SA tables in 0.02 min. Done!!
----------------------------------------------
Starting mapping...
End of mapping in 0.04 min. Done!!
----------------------------------------------
Output file : mapped/alignments.sam
Num. reads : 40718
Num. mapped reads : 40717 (100.00 %)
Num. unmapped reads: 1 (0.00 %)
Num. mappings : 41075
Num. multihit reads: 136
----------------------------------------------
To see all available parameters, type:
./hpg-aligner -h
To map the FASTQ files you need to run the tool with command rna as the dataset, the index folder is passed as argument, the default options should provide a good performance, i.e. all available cores will be used to speed up performance. We create a folder called mapped to store the results:
mkdir mapped
./hpg-aligner rna -f test_1.fq -i chr20-index -o mapped
+===============================================================+
| RNA MODE |
+===============================================================+
| ___ ___ ___ |
| \/ H \/ P \/ G \/ |
| /\___/\___/\___/\ |
| ___ ___ ___ ___ ___ ___ ___ |
| \/ A \/ L \/ I \/ G \/ N \/ E \/ R \/ |
| /\___/\___/\___/\___/\___/\___/\___/\ |
| |
+===============================================================+
Load Genome Status
[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100%
Mapping Status (First Phase)
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100%
Mapping Status (Second Phase)
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100%
+===============================================================+
| GLOBAL STATISTICS |
+===============================================================+
Loading Time (s) : 0.92 (s)
Alignment Time (s) : 0.24 (s)
Total Time (s) : 1.16 (s)
+===============================================================+
Total Reads Processed : 20359
+===============================================================+
Total Reads Mapped in First State : 13192 (64.80%)
+-------------------------------------------+-------------------+
Total Reads Mapped in Second State : 7167 (35.20%)
+-------------------------------------------+-------------------+
Total Reads Mapped : 20279 (99.61%)
Total Reads Unmapped : 80 (0.39%)
Total Reads with one alignment : 20127 (98.86%)
Total Reads with multiple alignment : 152 (0.75%)
+===============================================================+
| S P L I C E J U N C T I O N S S T A T I S T I C S |
+===============================================================+
Total splice junctions : 109
Total cannonical splice junctions : 57 (52.29%)
Total semi-cannonical splice junctions : 52 (47.71%)
+===============================================================+
In addtion to the alignments.sam that contains the found mappings, the mode RNA creates the file exact_junctions.bed to store the junctions.
head mapped/exact_junctions.bed
1 788280 902890 JUNCTION_0 1 + GC-AG
1 2098272 2165953 JUNCTION_1 1 + GT-AG
1 2140550 2387339 JUNCTION_2 1 + GC-AG
1 2203605 2244638 JUNCTION_3 1 + GC-AG
1 2456314 2511821 JUNCTION_4 1 + GC-AG
1 2572619 2581286 JUNCTION_5 1 + GT-AG
1 2629112 2629360 JUNCTION_6 1 + GT-AG
1 2662365 2765776 JUNCTION_7 1 + GT-AG
1 2890164 2927429 JUNCTION_8 1 + GT-AG
1 2970031 3038288 JUNCTION_9 1 + GT-AG
For paired-end mapping, use the parameter -j:
./hpg-aligner rna -f test_1.fq -j test_2.fq -i chr20-index -o mapped
You can follow the same steps that RNA-seq Aligning, but you should run the application with mpirun command. The number of process must be one more than the nodes that will be used (if you want use two nodes the number of process must be three) and the first and last process should be run in the same node:
Example for run with MPI RNA and two nodes:
mpirun -np 3 -hosts compute-0,compute-1,compute-0 ./bin/hpg-aligner rna -i index_path/ -f reads.fq
Example for run with MPI RNA and four nodes:
mpirun -np 5 -hosts compute-0,compute-1,compute-2,compute-3,compute-0 ./bin/hpg-aligner rna -i index_path/ -f reads.fq
ATENTION: Mvapich2 has an incompatibility with tcmalloc. You can install mvapich2 with --disable-registration-cache option or before you run MPI HPG Aligner execute this:
export MV2_USE_LAZY_MEM_UNREGISTER=0
If you compile MPI HPG Aligner without tcmalloc the perferomance will be degraded.