Merge pull request #8 from microbiomedata/develop

Update readme files
microbiomedata · Feb 18, 2021 · b5c7844 · b5c7844
2 parents 8aff2ea + 8ccc934
commit b5c7844
Show file tree

Hide file tree

Showing 2 changed files with 183 additions and 128 deletions.
diff --git a/README.rst b/README.rst
@@ -1,74 +1,92 @@
 The Read-based Analysis Workflow
 ================================
 
-Summary
--------
-
-The pipeline takes sequencing files (single- or paired-end) and profiles them using multiple taxonomic classification tools with the Cromwell as the workflow manager.
-
-Workflow Diagram
-----------------
-
 .. image:: docs/readbased_analysis_workflow.png
    :align: center
    :scale: 50%
 
+Workflow Overview
+-----------------
+The pipeline takes in sequencing files (single- or paired-end) and profiles them using multiple taxonomic classification tools with the Cromwell as the workflow manager.
+
+Workflow Availability
+---------------------
+The workflow is available in GitHub: https://github.com/microbiomedata/ReadbasedAnalysis; the corresponding Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/nmdc_taxa_profilers
+
+Requirements for Execution:  
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+(recommendations are in **bold**)
+
+- WDL-capable Workflow Execution Tool (**Cromwell**)
+- Container Runtime that can load Docker images (**Docker v2.1.0.3 or higher**)
+
+Hardware Requirements:
+~~~~~~~~~~~~~~~~~~~~~~
+- Disk space: 152 GB for databases (55 GB, 89 GB, and 8 GB for GOTTCHA2, Kraken2 and Centrifuge databases, respectively)
+- 60 GB RAM
+
 Workflow Dependencies
 ---------------------
 
-Third party software
+Third party software:
+~~~~~~~~~~~~~~~~~~~~~
+
+(These are included in the Docker image.)
+
+- `GOTTCHA2 v2.1.6 <https://github.com/poeli/GOTTCHA2>`_  (License: `BSD-3-Clause-LANL <https://github.com/poeli/GOTTCHA2/blob/master/LICENSE>`_)
+- `Kraken2 v2.0.8 <http://ccb.jhu.edu/software/kraken2>`_ (License: `MIT <https://github.com/DerrickWood/kraken2/blob/master/LICENSE>`_)
+- `Centrifuge v1.0.4 <http://www.ccb.jhu.edu/software/centrifuge>`_ (License: `GPL-3 <https://github.com/DaehwanKimLab/centrifuge/blob/master/LICENSE>`_)
+
+Requisite databases:
 ~~~~~~~~~~~~~~~~~~~~
 
-- GOTTCHA2: 2.1.6 `(BSD-3-Clause-LANL) <https://github.com/poeli/GOTTCHA2/blob/master/LICENSE>`_
-- Kraken2: 2.0.8 `(MIT) <https://github.com/DerrickWood/kraken2/blob/master/LICENSE>`_
-- Centrifuge: 1.0.4 `(GPL-3) <https://github.com/DaehwanKimLab/centrifuge/blob/master/LICENSE>`_
+The database for each tool must be downloaded and installed. These databases total 152 GB.
+- GOTTCHA2 database (gottcha2/):
 
-Database 
-~~~~~~~~
+The database RefSeqr90.cg.BacteriaArchaeaViruses.species.fna contains complete genomes of bacteria, archaea and viruses from RefSeq Release 90. The following commands will download the database:
 
-Each profiling tool requires databases stored in sub-directories at `/global/cfs/projectdirs/m3408/aim2/database/`.
+::
 
-- GOTTCHA2 database (gottcha2/): The database `RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna` is built by complete genomes of bacteria, archaea and viruses from RefSeq Release 90.
-- Kraken2 database (kraken2/): This is a standard Kraken 2 database, built by NCBI RefSeq genomes.
-- Centrifuge database (centrifuge/)
+    wget https://edge-dl.lanl.gov/GOTTCHA2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
+    tar -xvf RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
+    rm RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
 
-Workflow Availability
----------------------
+- Kraken2 database (kraken2/):
+
+This is a standard Kraken 2 database, built from NCBI RefSeq genomes. The following commands will download the database:
 
-The workflow is available in GitHub:
-https://github.com/microbiomedata/ReadbasedAnalysis
+::
 
-The container is available at Docker Hub (microbiomedata/nmdc_taxa_profilers):
-https://hub.docker.com/r/microbiomedata/nmdc_taxa_profilers
+    mkdir kraken2
+    wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20201202.tar.gz
+    tar -xzvf k2_standard_20201202.tar.gz -C kraken2
+    rm k2_standard_20201202.tar.gz
 
+- Centrifuge database (centrifuge/):
 
-Running Workflow in Cromwell
-----------------------------
+This is a compressed database built from RefSeq genomes of Bacteria and Archaea. The following commands will download the database:
 
-Description of the files:
+::
 
-- `.wdl`: the WDL file for read-based analysis pipeline.
-- `.wdl`: the WDL file for tasks of each tool.
-- `.json`: the example inputs.json file for the pipeline.
-- `.conf`: the conf file for running cromwell.
-- `.job`: example sbatch file.
+    mkdir centrifuge
+    wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed_2018_4_15.tar.gz 
+    tar -xzvf p_compressed_2018_4_15.tar.gz -C centrifuge
+    rm p_compressed_2018_4_15.tar.gz
 
-Test datasets
--------------
 
-Zymobiomics mock-community DNA control `(SRR7877884) <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_
+Sample dataset(s):
+~~~~~~~~~~~~~~~~~~
 
-Inputs
-~~~~~~
+Zymobiomics mock-community DNA control (SRR7877884); this dataset is ~7 GB.
 
-The input is a json file:
-
-- `ReadbasedAnalysis.enabled_tools`: set the value of the tool as `true` to enable different profiling tools
-- `ReadbasedAnalysis.db`: specify the path of the database
-- `ReadbasedAnalysis.reads`: specify the path of the reads
-- `ReadbasedAnalysis.prefix`: specify the prefix of output file names
-- `ReadbasedAnalysis.outdir`: specify the path of output directory
-- `ReadbasedAnalysis.cpu`: cpu numbers
+Input: A JSON file containing the following information:
+1. selection of profiling tools (set as true if selected)
+2. the paths to the required database(s) for the tools selected 
+3. the paths to the input fastq file(s) (paired-end data is shown; this can be the output of the Reads QC workflow in interleaved format which will be treated as single-end data.)
+4. the prefix for the output file names
+5. the path of the output directory
+6. CPU number requested for the run.
 
 .. code-block:: JSON
 
@@ -79,9 +97,9 @@ The input is a json file:
             "centrifuge": true
         },
         "ReadbasedAnalysis.db": {
-            "gottcha2": "/global/cfs/projectdirs/m3408/aim2/database/gottcha2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna",
-            "kraken2": "/global/cfs/projectdirs/m3408/aim2/database/kraken2/",
-            "centrifuge": "/global/cfs/projectdirs/m3408/aim2/database/centrifuge/p_compressed"
+            "gottcha2": "/path/to/database/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna",
+            "kraken2": " /path/to/kraken2",
+            "centrifuge": "/path/to/centrifuge/p_compressed"
         },
         "ReadbasedAnalysis.reads": [
             "/path/to/SRR7877884.1.fastq.gz",
@@ -90,41 +108,52 @@ The input is a json file:
         "ReadbasedAnalysis.paired": true,
         "ReadbasedAnalysis.prefix": "SRR7877884",
         "ReadbasedAnalysis.outdir": "/path/to/ReadbasedAnalysis",
-        "ReadbasedAnalysis.cpu": 8
+        "ReadbasedAnalysis.cpu": 4
     }
 
-
-Outputs
+Output:
 ~~~~~~~
 
-The workflow creates individual output directories for each tool, including classification results, logs.::
+The workflow creates an output JSON file and individual output sub-directories for each tool which include tabular classification results, a tabular report, and a Krona plot (html).::
 
     ReadbasedAnalysis/
+    |-- SRR7877884.json
     |-- centrifuge
-    |   |-- SRR7877884.classification.csv
-    |   |-- SRR7877884.krona.html
-    |   `-- SRR7877884.report.tsv
+    |   |-- SRR7877884.classification.tsv
+    |   |-- SRR7877884.report.tsv
+    |   `-- SRR7877884.krona.html
+    |   
     |-- gottcha2
     |   |-- SRR7877884.full.tsv
     |   |-- SRR7877884.krona.html
     |   `-- SRR7877884.tsv
+    |   
     `-- kraken2
-        |-- SRR7877884.classification.csv
+        |-- SRR7877884.classification.tsv
         |-- SRR7877884.krona.html
-        `-- SRR7877884.report.csv
+        `-- SRR7877884.report.tsv
+
 
+Below is an example of the output directory files with descriptions to the right.
 
-Requirements for Execution
---------------------------
+========================================  ==============================================
+FileName                                  Description
+----------------------------------------  ----------------------------------------------
+SRR7877884.json	                          ReadbasedAnalysis result JSON file
+centrifuge/SRR7877884.classification.tsv  Centrifuge output read classification TSV file
+centrifuge/SRR7877884.report.tsv          Centrifuge output report TSV file
+centrifuge/SRR7877884.krona.html          Centrifuge krona plot HTML file
+gottcha2/SRR7877884.full.tsv              GOTTCHA2 detail output TSV file
+gottcha2/SRR7877884.tsv                   GOTTCHA2 output report TSV file
+gottcha2/SRR7877884.krona.html            GOTTCHA2 krona plot HTML file
+kraken2/SRR7877884.classification.tsv     Kraken2 output read classification TSV file
+========================================  ==============================================
 
-- Docker or other Container Runtime
-- Cromwell or other WDL-capable Workflow Execution Tool
-- 50 GB RAM
 
 Version History
 ---------------
 
-- 0.0.1
+1.0.1 (release date 01/14/2021; previous versions: 1.0.0)
 
 Point of contact
 ----------------