Rescue potential false negative unmapped reads in alignment tools
Manuscript available now on bioRxiv: https://www.biorxiv.org/content/early/2018/06/13/345876
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Python3 is required with the following libraries:
These are included in requirements.txt
, run the following commands to install them:
pip install -r requirements.txt
Alternatively, type the following command to install these libraries:
pip3 install --upgrade biopython pysam intervaltree
The alignment tool that you will be using is also required. Currently, it supports the following aligners:
- STAR
- Subread
Please make sure that the aligner that you are using is in your path. Please note that BLASTN is also required for rescuing so make sure blastn is in your path.
Rescue unmapped reads. When running the script, you can specify a pre-built aligner's index. If you do not specify it, the script will first build the index automatically with the given FASTA file.
The script will produce a new sam file called <prefix>_rescued.sam
and some counting information.
python3 usage: scavenger.py [options] -G/--genome_file <genome_file> -i/--input <input> -at/--aligner_tool <aligner>
Option | Argument |
---|---|
-G/--genome_file <genome_file> |
Genome FASTA file |
-i/--input <input> |
A comma separated list of input reads (Example: readA.fq,readB.fq). If the reads are paired, use a space to separate reads 1 and 2 (Example: readA_1.fq,readB_1.fq readA_2.fq,readB_2.fq) |
-at/--aligner_tool <aligner> |
The alignment tool to perform alignment |
Option | Argument |
---|---|
-g/--genome_index <genome_index> |
The directory of the aligner's index. |
-a/--annotation <annotation> |
Annotation file to be used by index builder |
-be/--builder_extra_args <extra_args> |
Extra arguments for the aligner index building. Use this option with quotes (Example: "-be=<extra_args>" ) |
-c/--consensus_threshold |
Consensus threshold (Default: 0.6) |
--blast_perc_identity |
Minimum percentage of identity for BLASTN |
--blast_perc_query_coverage |
Minimum percentage of query coverage for BLASTN |
-r/--repeat_db <repeat_index> |
The location of index for repetitive sequence database, e.g. RepBase. Inclusion of this argument will filter out reads which align to the repetitive sequence database. |
-ae/--aligner_extra_args <extra_args> |
Extra arguments for the aligner. Use this option with quotes (Example: "-ae=<extra_args>" ) |
-o/--output_dir <output_dir> |
The output directory for the index (Default: current directory) |
-p/--output_prefix <prefix> |
The prefix for the output index folder (Default: uses the first input file as the prefix) |
--bam |
BAM output file format (Default: SAM output file format) |
--clean |
Keep alignment file but remove other files produced by aligner (Default: Keep all files) |
-t/--threads |
The number of threads to be used by the index builder (Default: 4) |
For rescuing reads using STAR
python3 scavenger.py -G genome.fa -i readA.fq -at star -t 8
Creates the index for a specified aligner
python3 utils/build_aligner_index.py [options] -G/--genome_file <genome_file> -at/--aligner_tool <aligner>
Option | Argument |
---|---|
-G/--genome_file <genome_file> |
The reference genome file in FASTA format |
-at/--aligner_tool <aligner> |
The alignment tool to build index for |
Option | Argument |
---|---|
-be/--builder_extra_args <extra_args> |
Extra arguments for the aligner index building. Use this option with quotes (Example: "-be=<extra_args>" ) |
-a/--annotation <annotation> |
The annotation file in GTF/GFF format |
-o/--output_dir <output_dir> |
The output directory for the index (Default: current directory) |
-p/--output_prefix <prefix> |
The prefix for the output index folder (Default: uses genome file as the prefix) |
-q/--quiet |
Set to silent the logging information (Default: False) |
-t/--threads |
The number of threads to be used by the index builder (Default: 4) |
python3 utils/build_aligner_index.py -G genome.fa -at star -t 8
Runs a specific aligner
python3 utils/run_aligner.py [options] -i/--input <input> -g/--genome_index <genome_index> -at/--aligner_tool <aligner>
Option | Argument |
---|---|
-i/--input <input> |
A comman separated list of input reads (Example: readA.fq,readB.fq). If the reads are paired, use a space to separate reads 1 and 2 (Example: readA_1.fq,readB_1.fq readA_2.fq,readB_2.fq) |
-g/--genome_index <genome_index> |
The directory of the aligner's index |
-at/--aligner_tool <aligner> |
The alignment tool to perform alignment |
Option | Argument |
---|---|
-ae/--aligner_extra_args <extra_args> |
Extra arguments for the aligner. Use this option with quotes (Example: "-ae=<extra_args>" ) |
-o/--output_dir <output_dir> |
The output directory for the index (Default: current directory) |
-p/--output_prefix <prefix> |
The prefix for the output index folder (Default: uses the first input file as the prefix) |
-q/--quiet |
Set to silent the logging information (Default: False) |
-t/--threads |
The number of threads to be used by the index builder (Default: 4) |
For a single single-end file using STAR
python3 utils/run_aligner.py -i readA.fq -g star_index/ -at star -t 8
For a single single-end files using Subread
python3 utils/run_aligner.py -i readA.fq,readB.fq -g subread_index/ -at subread -t 8