HTSQualC is an automated quality control analysis tool for a single and paired-end high-throughput sequencing data (HTS) generated from Illumina sequencing platforms.
- Simultaneously filter and/or trim reads for adapter or primer contamination, uncalled bases (N), and low-quality reads
- Supports single and paired-end reads
- Analyze multiple samples simultaneously
- Parallel computation for accelerating the speed of analysis
- Visualization and statistics
- Docker image is available
- Available on CyVerse Discovery Environment (DE)
- No dependency on an external open-source tool
You need Python 3 (tested on 3.6 and 3.7) to install and run HTSQualC. Following Python 3 packages need to install before running the HTSQualC. If you have not . installed these packages, HTSQualC will guide you to install them.
numpy
pysam
matplotlib
termcolor
datetime
Clone or download HTSQualC using following command,
git clone https://github.com/reneshbedre/HTSQualC.git
To install HTSQualC, run following command in the root folder,
python setup.py install
Install using conda,
conda install -c bioconda htseqqc
Print help message to see all required and optional parameters,
filter.py -h
usage: filter.py [-h] [-a INPUT_FILES_1] [-b INPUT_FILES_2] [-c QUAL_FMT]
[-e N_CONT] [-f ADPT_SEQS] [-d MIN_SIZE] [-g ADPT_MATCH]
[-i QUAL_THRESH] [-n TRIM_OPT] [-p WIND_SIZE]
[-r MIN_LEN_FILT] [-q CPU] [-m OUT_FMT] [-v VIS_OPT] [-z COMPRESS]
[--version]
Quality control analysis of single and paired-end sequence data
optional arguments:
-h, --help show this help message and exit
-a INPUT_FILES_1, --p1 INPUT_FILES_1
Single end input files or left files for paired-end
data (.fastq, .fq). Multiple sample files must be
separated by comma or space
-b INPUT_FILES_2, --p2 INPUT_FILES_2
Right files for paired-end data (.fastq, .fq).
Multiple files must be separated by comma or space
-c QUAL_FMT, --qfmt QUAL_FMT
Quality value format [1= Illumina 1.8, 2= Illumina
1.3,3= Sanger]. If quality format not provided, it
will automatically detect based on sequence data
-e N_CONT, --nb N_CONT
Filter the reads containing given % of uncalled bases
(N)
-f ADPT_SEQS, --adp ADPT_SEQS
Trim the adapter and truncate the read sequence
(multiple adapter sequences must be separated by
comma)
-d MIN_SIZE, --msz MIN_SIZE
Filter the reads which are lesser than minimum size
-g ADPT_MATCH, --per ADPT_MATCH
Truncate the read sequence if it matches to adapter
sequence equal or more than given percent (0.0-1.0)
[default=0.9]
-i QUAL_THRESH, --qthr QUAL_THRESH
Filter the read sequence if average quality of bases
in reads is lower than threshold (1-40) [default:20]
-n TRIM_OPT, --trim TRIM_OPT
If trim option set to True, the reads with low quality
(as defined by option --qthr) will be trimmed instead
of discarding [True|False] [default: False]
-p WIND_SIZE, --wsz WIND_SIZE
The window size for trimming (5->3) the reads. This
option should always set when -trim option is defined
[default: 5]
-r MIN_LEN_FILT, --mlk MIN_LEN_FILT
Minimum length of the reads to retain after trimming
-q CPU, --cpu CPU Number of CPU [default:2]
-m OUT_FMT, --ofmt OUT_FMT
Output file format (fastq/fasta) [default:fastq]
-v VIS_OPT, --no-vis VIS_OPT
No figures will be produced [True|False]
[default:False]
-z COMPRESS, --compress COMPRESS
Compress (.gz) the filtered FASTQ output [True|False]
[default:False]
--version show program's version number and exit
Run For single-end reads
# for single sample
filter.py OPTIONS -a fastq_file
# for multiple samples
filter.py OPTIONS -a fastq_file_1,fastq_file_2
Filter paired-end reads
# for single sample
filter.py OPTIONS -a fastq_file_left -b fastq_file_right
# for multiple samples
filter.py OPTIONS -a fastq_file_left_1,fastq_file_left_2 -b fastq_file_right_1,fastq_file_right_2
HTSQualC produces the filtered cleaned HTS data as FASTQ/FASTA files, and statistics and visualization of filtered cleaned HTS datasets. The output will be saved in folder with name ending as filtering_out.
This project is available under the MIT License. See complete details in LICENSE file.
Download the test paired and single end data using NCBI SRA toolkit
fastq-dump --split-files SRR2165176
fastq-dump --split-files SRR2165177
fastq-dump --split-files SRR2165178
fastq-dump SRR1805340
Run HTSQualC as a command line tool (Linux and Mac)
- for paired end data with default parameter (setting 1)
filter.py --cpu 18 --p1 SRR2165176_1.fastq --p2 SRR2165176_2.fastq
- for paired end data with quality threshold, adapter sequences, and uncalled based parameters (setting 2)
filter.py --cpu 18 --qthr 25 --nb 5 --adp AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --p1 SRR2165176_1.fastq --p2 SRR2165176_2.fastq
- for paired end data with default parameter and multiple samples (setting 3)
filter.py --cpu 18 --p1 SRR2165176_1.fastq,SRR2165177_1.fastq,SRR2165178_1.fastq --p2 SRR2165176_2.fastq,SRR2165177_2.fastq,SRR2165178_2.fastq
- for single end data with default parameter (setting 4)
filter.py --cpu 18 --p1 SRR1805340.fastq