microbiomedata · vlilanl · Nov 18, 2024 · Nov 15, 2024 · Nov 15, 2024 · Nov 18, 2024
diff --git a/README.md b/README.md
@@ -1,8 +1,13 @@
-# The Data Preprocessing workflow
+# The Data Preprocessing Workflow
 
 ## Summary
 
-This workflow is a replicate of the [QA protocol](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/data-preprocessing/) implemented at JGI for Illumina reads and use the program “rqcfilter2” from BBTools(38:96) which implements them as a pipeline. 
+This workflow is a replicate of the [QA protocol](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/data-preprocessing/) implemented at JGI for Illumina reads.
+
+This workflow utilizes the program `rqcfilter2` from BBTools to perform quality control on raw Illumina reads for **shortreads**. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using `BBDuk`), and performs human/cat/dog/mouse/microbe removal (using `BMap`).
+
+This workflow performs quality control on long reads from PacBio. The workflow performs duplicate removal (using `pbmarkdup`), inverted repeat filtering (using BBTools 
+`icecreamfinder.sh`), adapter trimming, and final filtering of reads with residual adapter sequences (using `bbduk`). The workflow is designed to handle input files in various formats, including .bam, .fq, or .fq.gz.
 
 ## Required Database
 
@@ -17,33 +22,27 @@ This workflow is a replicate of the [QA protocol](https://jgi.doe.gov/data-and-t
 	rm RQCFilterData.tgz
 ```
 
-## Running Workflow in Cromwell
-
-Description of the files:
- - `.wdl` file: the WDL file for workflow definition
- - `.json` file: the example input for the workflow
- - `.conf` file: the conf file for running Cromwell.
- - `.sh` file: the shell script for running the example workflow
-
 ## The Docker image and Dockerfile can be found here
 
-[microbiomedata/bbtools:38.92](https://hub.docker.com/r/microbiomedata/bbtools)
+[microbiomedata/bbtools:38.96](https://hub.docker.com/r/microbiomedata/bbtools)
 
 ## Input files
 
-1. database path, 
-2. fastq (illumina paired-end interleaved fastq), 
-3. project name 
-4. resource where run the workflow
-5. informed_by 
+1. the path to the interleaved fastq file (longreads and shortreads) 
+2. forwards reads fastq file (when input_interleaved is false)
+3. reverse reads fastq file (when input_interleaved is false)  
+4. project id
+5. if the input is interleaved (boolean) 
+6. if the input is shortreads (boolean)
 
 ```
 {
-    "nmdc_rqcfilter.database": "/global/cfs/projectdirs/m3408/aim2/database", 
-    "nmdc_rqcfilter.input_files": "/global/cfs/cdirs/m3408/ficus/8434.3.102077.AGTTCC.fastq.gz", 
-    "nmdc_rqcfilter.proj":"nmdc:xxxxxxx",
-    "nmdc_rqcfilter.resouce":"NERSC -- perlmutter",
-    "nmdc_rqcfilter.informed_by": "nmdc:xxxxxxxx"
+	"rqcfilter.input_files": ["https://portal.nersc.gov/project/m3408//test_data/smalltest.int.fastq.gz"],
+    	"rqcfilter.input_fq1": [],
+    	"rqcfilter.input_fq2": [],
+    	"rqcfilter.proj": "nmdc:xxxxxxx",
+   	"rqcfilter.interleaved": true,
+    	"rqcfilter.shortRead": true
 }
 ```
 
@@ -54,13 +53,22 @@ The output will have one directory named by prefix of the fastq input file and a
 The main QC fastq output is named by prefix.anqdpht.fast.gz. 
 
 ```
-|-- 8434.1.102069.ACAGTG.anqdpht.fastq.gz
-|-- filterStats.txt
-|-- filterStats.json
-|-- filterStats2.txt
-|-- adaptersDetected.fa
-|-- reproduce.sh
-|-- spikein.fq.gz
-|-- status.log
-|-- ...
+* Short Reads
+    output/
+    ├── nmdc_xxxxxxx_filtered.fastq.gz
+    ├── nmdc_xxxxxxx_filterStats.txt
+    ├── nmdc_xxxxxxx_filterStats2.txt
+    ├── nmdc_xxxxxxx_readsQC.info
+    └── nmdc_xxxxxxx_qa_stats.json
+# Long Reads
+    output/
+    ├── nmdc_xxxxxxx_pbmarkdupStats.txt
+    ├── nmdc_xxxxxxx_readsQC.info
+    ├── nmdc_xxxxxxx_bbdukEndsStats.json
+    ├── nmdc_xxxxxxx_icecreamStats.json
+    ├── nmdc_xxxxxxx_filtered.fastq.gz
+    └── nmdc_xxxxxxx_stats.json
 ```
+
+## Link to Doc Site
+Please refer [here](https://nmdc-workflow-documentation.readthedocs.io/en/latest/chapters/1_RQC_index.html) for more information.
diff --git a/docs/index.rst b/docs/index.rst
@@ -1,4 +1,4 @@
-Reads QC Workflow (v1.0.12)
+Reads QC Workflow (v1.0.13)
 =============================
 
 .. image:: rqc_workflow.png
@@ -9,9 +9,12 @@ Reads QC Workflow (v1.0.12)
 Workflow Overview
 -----------------
 
-This workflow utilizes the program “rqcfilter2” from BBTools to perform quality control on raw Illumina reads for **shortreads** and raw PacBio reads for **longreads**. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using BBDuk), and performs human/cat/dog/mouse/microbe removal (using BBMap).
+**Short Reads:**
+
+This workflow utilizes the program :literal:`rqcfilter2` from BBTools to perform quality control on raw Illumina reads for **shortreads**. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using :literal:`BBDuk`), and performs human/cat/dog/mouse/microbe removal (using :literal:`BMap`).
+
+The following parameters are used for :literal:`rqcfilter2` in this workflow::
 
-The following parameters are used for "rqcfilter2" in this workflow::
  - qtrim=r     :  Quality-trim from right ends before mapping.
  - trimq=0     :  Trim quality threshold.
  - maxns=3     :  Reads with more Ns than this will be discarded.
@@ -32,13 +35,32 @@ The following parameters are used for "rqcfilter2" in this workflow::
  - trimfragadapter=true:  Trim all known Illumina adapter sequences, including TruSeq and Nextera.
  - removemicrobes=true :  Remove common contaminant microbial reads via mapping, and place them in a separate file.
 
+**Long Reads:**
+
+This workflow performs quality control on long reads from PacBio. The workflow performs duplicate removal (using :literal:`pbmarkdup`), inverted repeat filtering (using BBTools 
+:literal:`icecreamfinder.sh`), adapter trimming, and final filtering of reads with residual adapter sequences (using :literal:`bbduk`). The workflow is designed to handle input files in various formats, including .bam, .fq, or .fq.gz.
+
+The following parameters are used for each stage in the workflow::
+
+- rmdup=true    : Enables duplicate removal in the initial filtering.
+- k=20, mink=12 : K-mer sizes for adapter detection and trimming.
+- edist=1       : Error distance for k-mer matches.
+- ktrimtips=60  : Trims adapters from the ends of reads.
+- phix=true     : Removes reads containing PhiX sequences.
+- json=true     : Outputs statistics in JSON format for easier parsing.
+- chastityfilter=true : Removes reads failing the chastity filter.
+- removehuman=true    : Removes human reads in contamination analysis (optional).
+- removemicrobes=true : Removes common microbial contaminants.
+
 
 Workflow Availability
 ---------------------
 
 The workflow from GitHub uses all the listed docker images to run all third-party tools.
-The workflow is available in GitHub: https://github.com/microbiomedata/ReadsQC; the corresponding
-Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/bbtools.
+
+The workflow is available in GitHub: https://github.com/microbiomedata/ReadsQC
+
+The Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/bbtools.
 
 Requirements for Execution 
 --------------------------
@@ -77,10 +99,10 @@ The following commands will download the database::
 
 Sample dataset(s)
 -----------------
+**Short Reads:**
 
-- small dataset: `Ecoli 10x <https://portal.nersc.gov/cfs/m3408/test_data/ReadsQC_small_test_data.tgz>`_ . You can find input/output in the downloaded tar gz file.
-
-- large dataset: Zymobiomics mock-community DNA control (`SRR7877884 <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_); the `original gzipped dataset <https://portal.nersc.gov/cfs/m3408/test_data/ReadsQC_large_test_data.tgz>`_ is ~5.7 GB.  You can find input/output in the downloaded tar gz file.
+- small dataset: `Ecoli 10x <https://portal.nersc.gov/cfs/m3408/test_data/ReadsQC_small_test_data.tgz>`_ . (Input/output included in tar.gz file).
+- large dataset: Zymobiomics mock-community DNA control (`SRR7877884 <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_); the `original gzipped dataset <https://portal.nersc.gov/cfs/m3408/test_data/ReadsQC_large_test_data.tgz>`_ is ~5.7 GB. (Input/output included in tar.gz file). For testing purposes and for the following examples, we used a 10% sub-sampling of the above dataset: `SRR7877884-int-0.1.fastq.gz <https://portal.nersc.gov/cfs/m3408/test_data/SRR7877884-int-0.1.fastq.gz>`_. This dataset is already interleaved.
 
 
 .. note::
@@ -90,20 +112,22 @@ Sample dataset(s)
 .. code-block:: bash    
 
     paste <(zcat SRR7877884_1.fastq.gz | paste - - - -) <(zcat SRR7877884_2.fastq.gz | paste - - - -) | tr '\t' '\n' | gzip -c > SRR7877884-int.fastq.gz
-
-For testing purposes and for the following examples, we used a 10% sub-sampling of the above dataset: `SRR7877884-int-0.1.fastq.gz <https://portal.nersc.gov/cfs/m3408/test_data/SRR7877884-int-0.1.fastq.gz>`_. This dataset is already interleaved.
 
-Inputs
+**Long Reads:**
+
+Zymobiomics synthetic metagenome (`SRR13128014 <https://portal.nersc.gov/project/m3408//test_data/SRR13128014.pacbio.subsample.ccs.fastq.gz>`_) For testing we have subsampled the dataset, the original dataset is ~18GB.
+
+Input
 ------
 
-A JSON file containing the following information: 
+A `JSON file <https://github.com/microbiomedata/ReadsQC/blob/documentation/input.json>`_ containing the following information: 
 
-1.	the path to the interleaved fastq file (longreads and shortreads) 
-2.	forwards reads fastq file (when input_interleaved is false)
-3.	reverse reads fastq file (when input_interleaved is false)  
-4.	project id
-5.      if the input is interleaved (boolean) 
-6.	if the input is shortreads (boolean)
+1. the path to the interleaved fastq file (longreads and shortreads) 
+2. forwards reads fastq file (when input_interleaved is false)
+3. reverse reads fastq file (when input_interleaved is false)  
+4. project id
+5. if the input is interleaved (boolean) 
+6. if the input is shortreads (boolean)
 
 
 An example input JSON file is shown below:
@@ -124,15 +148,35 @@ An example input JSON file is shown below:
 
     In an HPC environment, parallel processing allows for processing multiple samples, both interleaved and noninterleaved for **shortreads**. The "rqcfilter.input_files" parameter is an array data structure. It can be used for multiple samples as input separated by a comma (,).
 
-    Ex: **Interleaved**: "rqcfilter.input_files":[“first-int.fastq”,”second-int.fastq”]; **Non-Interleaved**: "rqcfilter.input_fq1": [“first-int-R1.fastq”,”second-int-R1.fastq”], "rqcfilter.input_fq2": [“first-int-R2.fastq”,”second-int-R2.fastq”]
+    Example: 
+	**Interleaved**: "rqcfilter.input_files":[“first-int.fastq”,”second-int.fastq”]; 
+
+	**Non-Interleaved**: "rqcfilter.input_fq1": [“first-int-R1.fastq”,”second-int-R1.fastq”], "rqcfilter.input_fq2": [“first-int-R2.fastq”,”second-int-R2.fastq”]
 
 
 Output
 ------
 
-A directory named with the prefix of the FASTQ input file will be created and multiple output files are generated; the main QC FASTQ output is named prefix.anqdpht.fastq.gz. Using the dataset above as an example, the main output would be named SRR7877884-int-0.1.anqdpht.fastq.gz. Other files include statistics on the quality of the data; what was trimmed, detected, and filtered in the data; a status log, and a shell script documenting the steps implemented so the workflow can be reproduced.
+The output directory will contain the following files for short reads::
+
+    output/
+    ├── nmdc_xxxxxxx_filtered.fastq.gz
+    ├── nmdc_xxxxxxx_filterStats.txt
+    ├── nmdc_xxxxxxx_filterStats2.txt
+    ├── nmdc_xxxxxxx_readsQC.info
+    └── nmdc_xxxxxxx_qa_stats.json
 
-An example output txt file (filterStats.txt) is shown below:
+The output directory will contain the following files for long reads::
+
+    output/
+    ├── nmdc_xxxxxxx_pbmarkdupStats.txt
+    ├── nmdc_xxxxxxx_readsQC.info
+    ├── nmdc_xxxxxxx_bbdukEndsStats.json
+    ├── nmdc_xxxxxxx_icecreamStats.json
+    ├── nmdc_xxxxxxx_filtered.fastq.gz
+    └── nmdc_xxxxxxx_stats.json
+
+An example output txt file (:literal:`filterStats.txt`) for short reads is shown below:
 
 .. code-block:: text 
 
@@ -156,46 +200,26 @@ Below is an example of all the output directory files with descriptions to the r
 ==================================== ============================================================================
 FileName                              Description
 ==================================== ============================================================================
+**Short Reads**
 nmdc_xxxxxxx_filtered.fastq.gz        main output (clean data)
 nmdc_xxxxxxx_filterStats.txt	      summary statistics 
 nmdc_xxxxxxx_filterStats2.txt	      more detailed summary statistics
-nmdc_xxxxxxx_readsQC.info	      summary of parameters used in BBTools rqcfilter2
+nmdc_xxxxxxx_readsQC.info	      summary of parameters used in :literal:`BBTools rqcfilter2`
 nmdc_xxxxxxx_qa_stats.json	      summary statistics of output bases, input reads, input bases, output reads
-
-adaptersDetected.fa                   adapters detected and removed        
-bhist.txt                             base composition histogram by position 
-cardinality.txt                       estimation of the number of unique kmers 
-commonMicrobes.txt                    detected common microbes 
-file-list.txt                         output file list for rqcfilter2.sh 
-gchist.txt                            GC content histogram 
-human.fq.gz                           detected human sequence reads 
-ihist_merge.txt                       insert size histogram 
-khist.txt                             kmer-frequency histogram 
-kmerStats1.txt                        synthetic molecule (phix, linker, lamda, pJET) filter run log  
-kmerStats2.txt                        synthetic molecule (short contamination) filter run log 
-ktrim_kmerStats1.txt                  detected adapters filter run log 
-ktrim_scaffoldStats1.txt              detected adapters filter statistics 
-microbes.fq.gz                        detected common microbes sequence reads 
-microbesUsed.txt                      common microbes list for detection 
-peaks.txt                             number of unique kmers in each peak on the histogram 
-phist.txt                             polymer length histogram 
-refStats.txt                          human reads filter statistics 
-reproduce.sh                          the shell script to reproduce the run
-scaffoldStats1.txt                    detected synthetic molecule (phix, linker, lamda, pJET) statistics 
-scaffoldStats2.txt                    detected synthetic molecule (short contamination) statistics 
-scaffoldStatsSpikein.txt              detected skipe-in kapa tag statistics 
-sketch.txt                            mash type sketch scanned result against nt, refseq, silva database sketches.  
-spikein.fq.gz                         detected skipe-in kapa tag sequence reads 
-status.log                            rqcfilter2.sh running log 
-synth1.fq.gz                          detected synthetic molecule (phix, linker, lamda, pJET) sequence reads 
-synth2.fq.gz                          detected synthetic molecule (short contamination) sequence reads 
+**Long Reads**
+nmdc_xxxxxxx_filtered.fastq.gz        main output (clean data)
+nmdc_xxxxxxx_pbmarkdupStats.txt       statistics from the :literal:`pbmarkdup` duplicate removal
+nmdc_xxxxxxx_readsQC.info             summary of parameters and tools used in QC
+nmdc_xxxxxxx_bbdukEndsStats.json      :literal:`JSON` statistics from :literal:`bbduk` adapter trimming on ends
+nmdc_xxxxxxx_icecreamStats.json       :literal:`JSON` statistics from inverted repeat filtering
+nmdc_xxxxxxx_stats.json               summary statistics of output bases, input reads, input bases, output reads
 ==================================== ============================================================================
 
 
 Version History
 ---------------
 
-- 1.0.12 (release date **09/30/2024**; previous versions: 1.0.11)
+- 1.0.13 (release date **11/07/2024**; previous versions: 1.0.12)
 
 
 Point of contact

diff --git a/docs/rqc_workflow.png b/docs/rqc_workflow.png