Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Documentation #41

Merged
merged 5 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 38 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
# The Data Preprocessing workflow
# The Data Preprocessing Workflow

## Summary

This workflow is a replicate of the [QA protocol](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/data-preprocessing/) implemented at JGI for Illumina reads and use the program “rqcfilter2” from BBTools(38:96) which implements them as a pipeline.
This workflow is a replicate of the [QA protocol](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/data-preprocessing/) implemented at JGI for Illumina reads.

This workflow utilizes the program `rqcfilter2` from BBTools to perform quality control on raw Illumina reads for **shortreads**. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using `BBDuk`), and performs human/cat/dog/mouse/microbe removal (using `BMap`).

This workflow performs quality control on long reads from PacBio. The workflow performs duplicate removal (using `pbmarkdup`), inverted repeat filtering (using BBTools
`icecreamfinder.sh`), adapter trimming, and final filtering of reads with residual adapter sequences (using `bbduk`). The workflow is designed to handle input files in various formats, including .bam, .fq, or .fq.gz.

## Required Database

Expand All @@ -17,33 +22,27 @@ This workflow is a replicate of the [QA protocol](https://jgi.doe.gov/data-and-t
rm RQCFilterData.tgz
```

## Running Workflow in Cromwell

Description of the files:
- `.wdl` file: the WDL file for workflow definition
- `.json` file: the example input for the workflow
- `.conf` file: the conf file for running Cromwell.
- `.sh` file: the shell script for running the example workflow

## The Docker image and Dockerfile can be found here

[microbiomedata/bbtools:38.92](https://hub.docker.com/r/microbiomedata/bbtools)
[microbiomedata/bbtools:38.96](https://hub.docker.com/r/microbiomedata/bbtools)

## Input files

1. database path,
2. fastq (illumina paired-end interleaved fastq),
3. project name
4. resource where run the workflow
5. informed_by
1. the path to the interleaved fastq file (longreads and shortreads)
2. forwards reads fastq file (when input_interleaved is false)
3. reverse reads fastq file (when input_interleaved is false)
4. project id
5. if the input is interleaved (boolean)
6. if the input is shortreads (boolean)

```
{
"nmdc_rqcfilter.database": "/global/cfs/projectdirs/m3408/aim2/database",
"nmdc_rqcfilter.input_files": "/global/cfs/cdirs/m3408/ficus/8434.3.102077.AGTTCC.fastq.gz",
"nmdc_rqcfilter.proj":"nmdc:xxxxxxx",
"nmdc_rqcfilter.resouce":"NERSC -- perlmutter",
"nmdc_rqcfilter.informed_by": "nmdc:xxxxxxxx"
"rqcfilter.input_files": ["https://portal.nersc.gov/project/m3408//test_data/smalltest.int.fastq.gz"],
"rqcfilter.input_fq1": [],
"rqcfilter.input_fq2": [],
"rqcfilter.proj": "nmdc:xxxxxxx",
"rqcfilter.interleaved": true,
"rqcfilter.shortRead": true
}
```

Expand All @@ -54,13 +53,22 @@ The output will have one directory named by prefix of the fastq input file and a
The main QC fastq output is named by prefix.anqdpht.fast.gz.

```
|-- 8434.1.102069.ACAGTG.anqdpht.fastq.gz
|-- filterStats.txt
|-- filterStats.json
|-- filterStats2.txt
|-- adaptersDetected.fa
|-- reproduce.sh
|-- spikein.fq.gz
|-- status.log
|-- ...
* Short Reads
output/
├── nmdc_xxxxxxx_filtered.fastq.gz
├── nmdc_xxxxxxx_filterStats.txt
├── nmdc_xxxxxxx_filterStats2.txt
├── nmdc_xxxxxxx_readsQC.info
└── nmdc_xxxxxxx_qa_stats.json
# Long Reads
output/
├── nmdc_xxxxxxx_pbmarkdupStats.txt
├── nmdc_xxxxxxx_readsQC.info
├── nmdc_xxxxxxx_bbdukEndsStats.json
├── nmdc_xxxxxxx_icecreamStats.json
├── nmdc_xxxxxxx_filtered.fastq.gz
└── nmdc_xxxxxxx_stats.json
```

## Link to Doc Site
Please refer [here](https://nmdc-workflow-documentation.readthedocs.io/en/latest/chapters/1_RQC_index.html) for more information.
126 changes: 75 additions & 51 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Reads QC Workflow (v1.0.12)
Reads QC Workflow (v1.0.13)
=============================

.. image:: rqc_workflow.png
Expand All @@ -9,9 +9,12 @@ Reads QC Workflow (v1.0.12)
Workflow Overview
-----------------

This workflow utilizes the program “rqcfilter2” from BBTools to perform quality control on raw Illumina reads for **shortreads** and raw PacBio reads for **longreads**. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using BBDuk), and performs human/cat/dog/mouse/microbe removal (using BBMap).
**Short Reads:**

This workflow utilizes the program :literal:`rqcfilter2` from BBTools to perform quality control on raw Illumina reads for **shortreads**. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using :literal:`BBDuk`), and performs human/cat/dog/mouse/microbe removal (using :literal:`BMap`).

The following parameters are used for :literal:`rqcfilter2` in this workflow::

The following parameters are used for "rqcfilter2" in this workflow::
- qtrim=r : Quality-trim from right ends before mapping.
- trimq=0 : Trim quality threshold.
- maxns=3 : Reads with more Ns than this will be discarded.
Expand All @@ -32,13 +35,32 @@ The following parameters are used for "rqcfilter2" in this workflow::
- trimfragadapter=true: Trim all known Illumina adapter sequences, including TruSeq and Nextera.
- removemicrobes=true : Remove common contaminant microbial reads via mapping, and place them in a separate file.

**Long Reads:**

This workflow performs quality control on long reads from PacBio. The workflow performs duplicate removal (using :literal:`pbmarkdup`), inverted repeat filtering (using BBTools
:literal:`icecreamfinder.sh`), adapter trimming, and final filtering of reads with residual adapter sequences (using :literal:`bbduk`). The workflow is designed to handle input files in various formats, including .bam, .fq, or .fq.gz.

The following parameters are used for each stage in the workflow::

- rmdup=true : Enables duplicate removal in the initial filtering.
- k=20, mink=12 : K-mer sizes for adapter detection and trimming.
- edist=1 : Error distance for k-mer matches.
- ktrimtips=60 : Trims adapters from the ends of reads.
- phix=true : Removes reads containing PhiX sequences.
- json=true : Outputs statistics in JSON format for easier parsing.
- chastityfilter=true : Removes reads failing the chastity filter.
- removehuman=true : Removes human reads in contamination analysis (optional).
- removemicrobes=true : Removes common microbial contaminants.


Workflow Availability
---------------------

The workflow from GitHub uses all the listed docker images to run all third-party tools.
The workflow is available in GitHub: https://github.com/microbiomedata/ReadsQC; the corresponding
Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/bbtools.

The workflow is available in GitHub: https://github.com/microbiomedata/ReadsQC

The Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/bbtools.

Requirements for Execution
--------------------------
Expand Down Expand Up @@ -77,10 +99,10 @@ The following commands will download the database::

Sample dataset(s)
-----------------
**Short Reads:**

- small dataset: `Ecoli 10x <https://portal.nersc.gov/cfs/m3408/test_data/ReadsQC_small_test_data.tgz>`_ . You can find input/output in the downloaded tar gz file.

- large dataset: Zymobiomics mock-community DNA control (`SRR7877884 <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_); the `original gzipped dataset <https://portal.nersc.gov/cfs/m3408/test_data/ReadsQC_large_test_data.tgz>`_ is ~5.7 GB. You can find input/output in the downloaded tar gz file.
- small dataset: `Ecoli 10x <https://portal.nersc.gov/cfs/m3408/test_data/ReadsQC_small_test_data.tgz>`_ . (Input/output included in tar.gz file).
- large dataset: Zymobiomics mock-community DNA control (`SRR7877884 <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_); the `original gzipped dataset <https://portal.nersc.gov/cfs/m3408/test_data/ReadsQC_large_test_data.tgz>`_ is ~5.7 GB. (Input/output included in tar.gz file). For testing purposes and for the following examples, we used a 10% sub-sampling of the above dataset: `SRR7877884-int-0.1.fastq.gz <https://portal.nersc.gov/cfs/m3408/test_data/SRR7877884-int-0.1.fastq.gz>`_. This dataset is already interleaved.


.. note::
Expand All @@ -90,20 +112,22 @@ Sample dataset(s)
.. code-block:: bash

paste <(zcat SRR7877884_1.fastq.gz | paste - - - -) <(zcat SRR7877884_2.fastq.gz | paste - - - -) | tr '\t' '\n' | gzip -c > SRR7877884-int.fastq.gz

For testing purposes and for the following examples, we used a 10% sub-sampling of the above dataset: `SRR7877884-int-0.1.fastq.gz <https://portal.nersc.gov/cfs/m3408/test_data/SRR7877884-int-0.1.fastq.gz>`_. This dataset is already interleaved.

Inputs
**Long Reads:**

Zymobiomics synthetic metagenome (`SRR13128014 <https://portal.nersc.gov/project/m3408//test_data/SRR13128014.pacbio.subsample.ccs.fastq.gz>`_) For testing we have subsampled the dataset, the original dataset is ~18GB.

Input
------

A JSON file containing the following information:
A `JSON file <https://github.com/microbiomedata/ReadsQC/blob/documentation/input.json>`_ containing the following information:

1. the path to the interleaved fastq file (longreads and shortreads)
2. forwards reads fastq file (when input_interleaved is false)
3. reverse reads fastq file (when input_interleaved is false)
4. project id
5. if the input is interleaved (boolean)
6. if the input is shortreads (boolean)
1. the path to the interleaved fastq file (longreads and shortreads)
2. forwards reads fastq file (when input_interleaved is false)
3. reverse reads fastq file (when input_interleaved is false)
4. project id
5. if the input is interleaved (boolean)
6. if the input is shortreads (boolean)


An example input JSON file is shown below:
Expand All @@ -124,15 +148,35 @@ An example input JSON file is shown below:

In an HPC environment, parallel processing allows for processing multiple samples, both interleaved and noninterleaved for **shortreads**. The "rqcfilter.input_files" parameter is an array data structure. It can be used for multiple samples as input separated by a comma (,).

Ex: **Interleaved**: "rqcfilter.input_files":[“first-int.fastq”,”second-int.fastq”]; **Non-Interleaved**: "rqcfilter.input_fq1": [“first-int-R1.fastq”,”second-int-R1.fastq”], "rqcfilter.input_fq2": [“first-int-R2.fastq”,”second-int-R2.fastq”]
Example:
**Interleaved**: "rqcfilter.input_files":[“first-int.fastq”,”second-int.fastq”];

**Non-Interleaved**: "rqcfilter.input_fq1": [“first-int-R1.fastq”,”second-int-R1.fastq”], "rqcfilter.input_fq2": [“first-int-R2.fastq”,”second-int-R2.fastq”]


Output
------

A directory named with the prefix of the FASTQ input file will be created and multiple output files are generated; the main QC FASTQ output is named prefix.anqdpht.fastq.gz. Using the dataset above as an example, the main output would be named SRR7877884-int-0.1.anqdpht.fastq.gz. Other files include statistics on the quality of the data; what was trimmed, detected, and filtered in the data; a status log, and a shell script documenting the steps implemented so the workflow can be reproduced.
The output directory will contain the following files for short reads::

output/
├── nmdc_xxxxxxx_filtered.fastq.gz
├── nmdc_xxxxxxx_filterStats.txt
├── nmdc_xxxxxxx_filterStats2.txt
├── nmdc_xxxxxxx_readsQC.info
└── nmdc_xxxxxxx_qa_stats.json

An example output txt file (filterStats.txt) is shown below:
The output directory will contain the following files for long reads::

output/
├── nmdc_xxxxxxx_pbmarkdupStats.txt
├── nmdc_xxxxxxx_readsQC.info
├── nmdc_xxxxxxx_bbdukEndsStats.json
├── nmdc_xxxxxxx_icecreamStats.json
├── nmdc_xxxxxxx_filtered.fastq.gz
└── nmdc_xxxxxxx_stats.json

An example output txt file (:literal:`filterStats.txt`) for short reads is shown below:

.. code-block:: text

Expand All @@ -156,46 +200,26 @@ Below is an example of all the output directory files with descriptions to the r
==================================== ============================================================================
FileName Description
==================================== ============================================================================
**Short Reads**
nmdc_xxxxxxx_filtered.fastq.gz main output (clean data)
nmdc_xxxxxxx_filterStats.txt summary statistics
nmdc_xxxxxxx_filterStats2.txt more detailed summary statistics
nmdc_xxxxxxx_readsQC.info summary of parameters used in BBTools rqcfilter2
nmdc_xxxxxxx_readsQC.info summary of parameters used in :literal:`BBTools rqcfilter2`
nmdc_xxxxxxx_qa_stats.json summary statistics of output bases, input reads, input bases, output reads

adaptersDetected.fa adapters detected and removed
bhist.txt base composition histogram by position
cardinality.txt estimation of the number of unique kmers
commonMicrobes.txt detected common microbes
file-list.txt output file list for rqcfilter2.sh
gchist.txt GC content histogram
human.fq.gz detected human sequence reads
ihist_merge.txt insert size histogram
khist.txt kmer-frequency histogram
kmerStats1.txt synthetic molecule (phix, linker, lamda, pJET) filter run log
kmerStats2.txt synthetic molecule (short contamination) filter run log
ktrim_kmerStats1.txt detected adapters filter run log
ktrim_scaffoldStats1.txt detected adapters filter statistics
microbes.fq.gz detected common microbes sequence reads
microbesUsed.txt common microbes list for detection
peaks.txt number of unique kmers in each peak on the histogram
phist.txt polymer length histogram
refStats.txt human reads filter statistics
reproduce.sh the shell script to reproduce the run
scaffoldStats1.txt detected synthetic molecule (phix, linker, lamda, pJET) statistics
scaffoldStats2.txt detected synthetic molecule (short contamination) statistics
scaffoldStatsSpikein.txt detected skipe-in kapa tag statistics
sketch.txt mash type sketch scanned result against nt, refseq, silva database sketches.
spikein.fq.gz detected skipe-in kapa tag sequence reads
status.log rqcfilter2.sh running log
synth1.fq.gz detected synthetic molecule (phix, linker, lamda, pJET) sequence reads
synth2.fq.gz detected synthetic molecule (short contamination) sequence reads
**Long Reads**
nmdc_xxxxxxx_filtered.fastq.gz main output (clean data)
nmdc_xxxxxxx_pbmarkdupStats.txt statistics from the :literal:`pbmarkdup` duplicate removal
nmdc_xxxxxxx_readsQC.info summary of parameters and tools used in QC
nmdc_xxxxxxx_bbdukEndsStats.json :literal:`JSON` statistics from :literal:`bbduk` adapter trimming on ends
nmdc_xxxxxxx_icecreamStats.json :literal:`JSON` statistics from inverted repeat filtering
nmdc_xxxxxxx_stats.json summary statistics of output bases, input reads, input bases, output reads
==================================== ============================================================================


Version History
---------------

- 1.0.12 (release date **09/30/2024**; previous versions: 1.0.11)
- 1.0.13 (release date **11/07/2024**; previous versions: 1.0.12)


Point of contact
Expand Down
Binary file modified docs/rqc_workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading