Short Reads:
This workflow utilizes the program rqcfilter2
from BBTools to perform quality control on raw Illumina reads for shortreads. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using BBDuk
), and performs human/cat/dog/mouse/microbe removal (using BMap
).
The following parameters are used for rqcfilter2
in this workflow:
- qtrim=r : Quality-trim from right ends before mapping. - trimq=0 : Trim quality threshold. - maxns=3 : Reads with more Ns than this will be discarded. - maq=3 : Reads with average quality (before trimming) below this will be discarded. - minlen=51 : Reads shorter than this after trimming will be discarded. Pairs will be discarded only if both are shorter. - mlf=0.33 : Reads shorter than this fraction of original length after trimming will be discarded. - phix=true : Remove reads containing phiX kmers. - khist=true : Generate a kmer-frequency histogram of the output data. - kapa=true : Remove and quantify kapa tag - trimpolyg=5 : Trim reads that start or end with a G polymer at least this long - clumpify=true : Run clumpify; all deduplication flags require this. - removehuman=true : Remove human reads via mapping. - removedog=true : Remove dog reads via mapping. - removecat=true : Remove cat reads via mapping. - removemouse=true : Remove mouse reads via mapping. - barcodefilter=false : Disable improper barcodes filter - chastityfilter=false: Remove illumina reads failing chastity filter. - trimfragadapter=true: Trim all known Illumina adapter sequences, including TruSeq and Nextera. - removemicrobes=true : Remove common contaminant microbial reads via mapping, and place them in a separate file.
Long Reads:
This workflow performs quality control on long reads from PacBio. The workflow performs duplicate removal (using pbmarkdup
), inverted repeat filtering (using BBTools
icecreamfinder.sh
), adapter trimming, and final filtering of reads with residual adapter sequences (using bbduk
). The workflow is designed to handle input files in various formats, including .bam, .fq, or .fq.gz.
The following parameters are used for each stage in the workflow:
- rmdup=true : Enables duplicate removal in the initial filtering. - k=20, mink=12 : K-mer sizes for adapter detection and trimming. - edist=1 : Error distance for k-mer matches. - ktrimtips=60 : Trims adapters from the ends of reads. - phix=true : Removes reads containing PhiX sequences. - json=true : Outputs statistics in JSON format for easier parsing. - chastityfilter=true : Removes reads failing the chastity filter. - removehuman=true : Removes human reads in contamination analysis (optional). - removemicrobes=true : Removes common microbial contaminants.
The workflow from GitHub uses all the listed docker images to run all third-party tools.
The workflow is available in GitHub: https://github.com/microbiomedata/ReadsQC
The Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/bbtools.
(recommendations are in bold)
- WDL-capable Workflow Execution Tool (Cromwell)
- Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)
- Disk space: 106 GB for the RQCFilterData database
- Memory: >40 GB RAM
- BBTools v38.96 (License: BSD-3-Clause-LBNL)
The RQCFilterData Database must be downloaded and installed. This is a 106 GB tar file which includes reference datasets of artifacts, adapters, contaminants, the phiX genome, and some host genomes.
The following commands will download the database:
mkdir refdata wget http://portal.nersc.gov/dna/microbial/assembly/bushnell/RQCFilterData.tar tar -xvf RQCFilterData.tar -C refdata rm RQCFilterData.tar
Short Reads:
- small dataset: Ecoli 10x . (Input/output included in tar.gz file).
- large dataset: Zymobiomics mock-community DNA control (SRR7877884); the original gzipped dataset is ~5.7 GB. (Input/output included in tar.gz file). For testing purposes and for the following examples, we used a 10% sub-sampling of the above dataset: SRR7877884-int-0.1.fastq.gz. This dataset is already interleaved.
Note
If the input data is paired-end data, it must be in interleaved format. The following command will interleave the files, using the above dataset as an example:
paste <(zcat SRR7877884_1.fastq.gz | paste - - - -) <(zcat SRR7877884_2.fastq.gz | paste - - - -) | tr '\t' '\n' | gzip -c > SRR7877884-int.fastq.gz
Long Reads:
Zymobiomics synthetic metagenome (SRR13128014) For testing we have subsampled the dataset, the original dataset is ~18GB.
A JSON file containing the following information:
- the path to the interleaved fastq file (longreads and shortreads)
- forwards reads fastq file (when input_interleaved is false)
- reverse reads fastq file (when input_interleaved is false)
- project id
- if the input is interleaved (boolean)
- if the input is shortreads (boolean)
An example input JSON file is shown below: Short Reads, Interleaved
{
"rqcfilter.input_files": ["https://portal.nersc.gov/project/m3408//test_data/smalltest.int.fastq.gz"],
"rqcfilter.input_fq1": [],
"rqcfilter.input_fq2": [],
"rqcfilter.proj": "nmdc:xxxxxxx",
"rqcfilter.interleaved": true,
"rqcfilter.shortRead": true
}
Note
In an HPC environment, parallel processing allows for processing multiple samples, both interleaved and noninterleaved for shortreads. The "rqcfilter.input_files" parameter is an array data structure. It can be used for multiple samples as input separated by a comma (,).
- Example:
Interleaved: "rqcfilter.input_files":[“first-int.fastq”,”second-int.fastq”];
Non-Interleaved: "rqcfilter.input_fq1": [“first-int-R1.fastq”,”second-int-R1.fastq”], "rqcfilter.input_fq2": [“first-int-R2.fastq”,”second-int-R2.fastq”]
The output directory will contain the following files for short reads:
output/ ├── nmdc_xxxxxxx_filtered.fastq.gz ├── nmdc_xxxxxxx_filterStats.txt ├── nmdc_xxxxxxx_filterStats2.txt ├── nmdc_xxxxxxx_readsQC.info └── nmdc_xxxxxxx_qa_stats.json
The output directory will contain the following files for long reads:
output/ ├── nmdc_xxxxxxx_pbmarkdupStats.txt ├── nmdc_xxxxxxx_readsQC.info ├── nmdc_xxxxxxx_bbdukEndsStats.json ├── nmdc_xxxxxxx_icecreamStats.json ├── nmdc_xxxxxxx_filtered.fastq.gz └── nmdc_xxxxxxx_stats.json
An example output txt file (filterStats.txt
) for short reads is shown below:
inputReads=250000
inputBases=37109226
qtrimmedReads=0
qtrimmedBases=0
qfilteredReads=208
qfilteredBases=10798
ktrimmedReads=456
ktrimmedBases=7726
kfilteredReads=0
kfilteredBases=0
outputReads=249398
outputBases=37003919
gcPolymerRatio=0.165888
Below is an example of all the output directory files with descriptions to the right.
FileName | Description |
---|---|
Short Reads | |
nmdc_xxxxxxx_filtered.fastq.gz | main output (clean data) |
nmdc_xxxxxxx_filterStats.txt | summary statistics |
nmdc_xxxxxxx_filterStats2.txt | more detailed summary statistics |
nmdc_xxxxxxx_readsQC.info | summary of parameters used in BBTools rqcfilter2 |
nmdc_xxxxxxx_qa_stats.json | summary statistics of output bases, input reads, input bases, output reads |
Long Reads | |
nmdc_xxxxxxx_filtered.fastq.gz | main output (clean data) |
nmdc_xxxxxxx_pbmarkdupStats.txt | statistics from the pbmarkdup duplicate removal |
nmdc_xxxxxxx_readsQC.info | summary of parameters and tools used in QC |
nmdc_xxxxxxx_bbdukEndsStats.json | JSON statistics from bbduk adapter trimming on ends |
nmdc_xxxxxxx_icecreamStats.json | JSON statistics from inverted repeat filtering |
nmdc_xxxxxxx_stats.json | summary statistics of output bases, input reads, input bases, output reads |
- 1.0.13 (release date 11/07/2024; previous versions: 1.0.12)
- Original author: Brian Bushnell <[email protected]>
- Package maintainer: Chienchi Lo <[email protected]>