diff --git a/README.rst b/README.rst index 06000db..bd7a6c7 100644 --- a/README.rst +++ b/README.rst @@ -1,74 +1,92 @@ The Read-based Analysis Workflow ================================ -Summary -------- - -The pipeline takes sequencing files (single- or paired-end) and profiles them using multiple taxonomic classification tools with the Cromwell as the workflow manager. - -Workflow Diagram ----------------- - .. image:: docs/readbased_analysis_workflow.png :align: center :scale: 50% +Workflow Overview +----------------- +The pipeline takes in sequencing files (single- or paired-end) and profiles them using multiple taxonomic classification tools with the Cromwell as the workflow manager. + +Workflow Availability +--------------------- +The workflow is available in GitHub: https://github.com/microbiomedata/ReadbasedAnalysis; the corresponding Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/nmdc_taxa_profilers + +Requirements for Execution: +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +(recommendations are in **bold**) + +- WDL-capable Workflow Execution Tool (**Cromwell**) +- Container Runtime that can load Docker images (**Docker v2.1.0.3 or higher**) + +Hardware Requirements: +~~~~~~~~~~~~~~~~~~~~~~ +- Disk space: 152 GB for databases (55 GB, 89 GB, and 8 GB for GOTTCHA2, Kraken2 and Centrifuge databases, respectively) +- 60 GB RAM + Workflow Dependencies --------------------- -Third party software +Third party software: +~~~~~~~~~~~~~~~~~~~~~ + +(These are included in the Docker image.) + +- `GOTTCHA2 v2.1.6 `_ (License: `BSD-3-Clause-LANL `_) +- `Kraken2 v2.0.8 `_ (License: `MIT `_) +- `Centrifuge v1.0.4 `_ (License: `GPL-3 `_) + +Requisite databases: ~~~~~~~~~~~~~~~~~~~~ -- GOTTCHA2: 2.1.6 `(BSD-3-Clause-LANL) `_ -- Kraken2: 2.0.8 `(MIT) `_ -- Centrifuge: 1.0.4 `(GPL-3) `_ +The database for each tool must be downloaded and installed. These databases total 152 GB. +- GOTTCHA2 database (gottcha2/): -Database -~~~~~~~~ +The database RefSeqr90.cg.BacteriaArchaeaViruses.species.fna contains complete genomes of bacteria, archaea and viruses from RefSeq Release 90. The following commands will download the database: -Each profiling tool requires databases stored in sub-directories at `/global/cfs/projectdirs/m3408/aim2/database/`. +:: -- GOTTCHA2 database (gottcha2/): The database `RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna` is built by complete genomes of bacteria, archaea and viruses from RefSeq Release 90. -- Kraken2 database (kraken2/): This is a standard Kraken 2 database, built by NCBI RefSeq genomes. -- Centrifuge database (centrifuge/) + wget https://edge-dl.lanl.gov/GOTTCHA2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar + tar -xvf RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar + rm RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar -Workflow Availability ---------------------- +- Kraken2 database (kraken2/): + +This is a standard Kraken 2 database, built from NCBI RefSeq genomes. The following commands will download the database: -The workflow is available in GitHub: -https://github.com/microbiomedata/ReadbasedAnalysis +:: -The container is available at Docker Hub (microbiomedata/nmdc_taxa_profilers): -https://hub.docker.com/r/microbiomedata/nmdc_taxa_profilers + mkdir kraken2 + wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20201202.tar.gz + tar -xzvf k2_standard_20201202.tar.gz -C kraken2 + rm k2_standard_20201202.tar.gz +- Centrifuge database (centrifuge/): -Running Workflow in Cromwell ----------------------------- +This is a compressed database built from RefSeq genomes of Bacteria and Archaea. The following commands will download the database: -Description of the files: +:: -- `.wdl`: the WDL file for read-based analysis pipeline. -- `.wdl`: the WDL file for tasks of each tool. -- `.json`: the example inputs.json file for the pipeline. -- `.conf`: the conf file for running cromwell. -- `.job`: example sbatch file. + mkdir centrifuge + wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed_2018_4_15.tar.gz + tar -xzvf p_compressed_2018_4_15.tar.gz -C centrifuge + rm p_compressed_2018_4_15.tar.gz -Test datasets -------------- -Zymobiomics mock-community DNA control `(SRR7877884) `_ +Sample dataset(s): +~~~~~~~~~~~~~~~~~~ -Inputs -~~~~~~ +Zymobiomics mock-community DNA control (SRR7877884); this dataset is ~7 GB. -The input is a json file: - -- `ReadbasedAnalysis.enabled_tools`: set the value of the tool as `true` to enable different profiling tools -- `ReadbasedAnalysis.db`: specify the path of the database -- `ReadbasedAnalysis.reads`: specify the path of the reads -- `ReadbasedAnalysis.prefix`: specify the prefix of output file names -- `ReadbasedAnalysis.outdir`: specify the path of output directory -- `ReadbasedAnalysis.cpu`: cpu numbers +Input: A JSON file containing the following information: +1. selection of profiling tools (set as true if selected) +2. the paths to the required database(s) for the tools selected +3. the paths to the input fastq file(s) (paired-end data is shown; this can be the output of the Reads QC workflow in interleaved format which will be treated as single-end data.) +4. the prefix for the output file names +5. the path of the output directory +6. CPU number requested for the run. .. code-block:: JSON @@ -79,9 +97,9 @@ The input is a json file: "centrifuge": true }, "ReadbasedAnalysis.db": { - "gottcha2": "/global/cfs/projectdirs/m3408/aim2/database/gottcha2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna", - "kraken2": "/global/cfs/projectdirs/m3408/aim2/database/kraken2/", - "centrifuge": "/global/cfs/projectdirs/m3408/aim2/database/centrifuge/p_compressed" + "gottcha2": "/path/to/database/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna", + "kraken2": " /path/to/kraken2", + "centrifuge": "/path/to/centrifuge/p_compressed" }, "ReadbasedAnalysis.reads": [ "/path/to/SRR7877884.1.fastq.gz", @@ -90,41 +108,52 @@ The input is a json file: "ReadbasedAnalysis.paired": true, "ReadbasedAnalysis.prefix": "SRR7877884", "ReadbasedAnalysis.outdir": "/path/to/ReadbasedAnalysis", - "ReadbasedAnalysis.cpu": 8 + "ReadbasedAnalysis.cpu": 4 } - -Outputs +Output: ~~~~~~~ -The workflow creates individual output directories for each tool, including classification results, logs.:: +The workflow creates an output JSON file and individual output sub-directories for each tool which include tabular classification results, a tabular report, and a Krona plot (html).:: ReadbasedAnalysis/ + |-- SRR7877884.json |-- centrifuge - | |-- SRR7877884.classification.csv - | |-- SRR7877884.krona.html - | `-- SRR7877884.report.tsv + | |-- SRR7877884.classification.tsv + | |-- SRR7877884.report.tsv + | `-- SRR7877884.krona.html + | |-- gottcha2 | |-- SRR7877884.full.tsv | |-- SRR7877884.krona.html | `-- SRR7877884.tsv + | `-- kraken2 - |-- SRR7877884.classification.csv + |-- SRR7877884.classification.tsv |-- SRR7877884.krona.html - `-- SRR7877884.report.csv + `-- SRR7877884.report.tsv + +Below is an example of the output directory files with descriptions to the right. -Requirements for Execution --------------------------- +======================================== ============================================== +FileName Description +---------------------------------------- ---------------------------------------------- +SRR7877884.json ReadbasedAnalysis result JSON file +centrifuge/SRR7877884.classification.tsv Centrifuge output read classification TSV file +centrifuge/SRR7877884.report.tsv Centrifuge output report TSV file +centrifuge/SRR7877884.krona.html Centrifuge krona plot HTML file +gottcha2/SRR7877884.full.tsv GOTTCHA2 detail output TSV file +gottcha2/SRR7877884.tsv GOTTCHA2 output report TSV file +gottcha2/SRR7877884.krona.html GOTTCHA2 krona plot HTML file +kraken2/SRR7877884.classification.tsv Kraken2 output read classification TSV file +======================================== ============================================== -- Docker or other Container Runtime -- Cromwell or other WDL-capable Workflow Execution Tool -- 50 GB RAM Version History --------------- -- 0.0.1 +1.0.1 (release date 01/14/2021; previous versions: 1.0.0) Point of contact ---------------- diff --git a/docs/index.rst b/docs/index.rst index 266401f..91988b6 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,74 +1,92 @@ The Read-based Analysis Workflow ================================ -Summary -------- - -The pipeline takes sequencing files (single- or paired-end) and profiles them using multiple taxonomic classification tools with the Cromwell as the workflow manager. - -Workflow Diagram ----------------- - .. image:: readbased_analysis_workflow.png :align: center :scale: 50% +Workflow Overview +----------------- +The pipeline takes in sequencing files (single- or paired-end) and profiles them using multiple taxonomic classification tools with the Cromwell as the workflow manager. + +Workflow Availability +--------------------- +The workflow is available in GitHub: https://github.com/microbiomedata/ReadbasedAnalysis; the corresponding Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/nmdc_taxa_profilers + +Requirements for Execution: +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +(recommendations are in **bold**) + +- WDL-capable Workflow Execution Tool (**Cromwell**) +- Container Runtime that can load Docker images (**Docker v2.1.0.3 or higher**) + +Hardware Requirements: +~~~~~~~~~~~~~~~~~~~~~~ +- Disk space: 152 GB for databases (55 GB, 89 GB, and 8 GB for GOTTCHA2, Kraken2 and Centrifuge databases, respectively) +- 60 GB RAM + Workflow Dependencies --------------------- -Third party software +Third party software: +~~~~~~~~~~~~~~~~~~~~~ + +(These are included in the Docker image.) + +- `GOTTCHA2 v2.1.6 `_ (License: `BSD-3-Clause-LANL `_) +- `Kraken2 v2.0.8 `_ (License: `MIT `_) +- `Centrifuge v1.0.4 `_ (License: `GPL-3 `_) + +Requisite databases: ~~~~~~~~~~~~~~~~~~~~ -- GOTTCHA2: 2.1.6 `(BSD-3-Clause-LANL) `_ -- Kraken2: 2.0.8 `(MIT) `_ -- Centrifuge: 1.0.4 `(GPL-3) `_ +The database for each tool must be downloaded and installed. These databases total 152 GB. +- GOTTCHA2 database (gottcha2/): -Database -~~~~~~~~ +The database RefSeqr90.cg.BacteriaArchaeaViruses.species.fna contains complete genomes of bacteria, archaea and viruses from RefSeq Release 90. The following commands will download the database: -Each profiling tool requires databases stored in sub-directories at `/global/cfs/projectdirs/m3408/aim2/database/`. +:: -- GOTTCHA2 database (gottcha2/): The database `RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna` is built by complete genomes of bacteria, archaea and viruses from RefSeq Release 90. -- Kraken2 database (kraken2/): This is a standard Kraken 2 database, built by NCBI RefSeq genomes. -- Centrifuge database (centrifuge/) + wget https://edge-dl.lanl.gov/GOTTCHA2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar + tar -xvf RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar + rm RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar -Workflow Availability ---------------------- +- Kraken2 database (kraken2/): -The workflow is available in GitHub: -https://github.com/microbiomedata/ReadbasedAnalysis +This is a standard Kraken 2 database, built from NCBI RefSeq genomes. The following commands will download the database: -The container is available at Docker Hub (microbiomedata/nmdc_taxa_profilers): -https://hub.docker.com/r/microbiomedata/nmdc_taxa_profilers +:: + mkdir kraken2 + wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20201202.tar.gz + tar -xzvf k2_standard_20201202.tar.gz -C kraken2 + rm k2_standard_20201202.tar.gz -Running Workflow in Cromwell ----------------------------- +- Centrifuge database (centrifuge/): -Description of the files: +This is a compressed database built from RefSeq genomes of Bacteria and Archaea. The following commands will download the database: -- `.wdl`: the WDL file for read-based analysis pipeline. -- `.wdl`: the WDL file for tasks of each tool. -- `.json`: the example inputs.json file for the pipeline. -- `.conf`: the conf file for running cromwell. -- `.job`: example sbatch file. +:: -Test datasets -------------- + mkdir centrifuge + wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed_2018_4_15.tar.gz + tar -xzvf p_compressed_2018_4_15.tar.gz -C centrifuge + rm p_compressed_2018_4_15.tar.gz -Zymobiomics mock-community DNA control `(SRR7877884) `_ -Inputs -~~~~~~ +Sample dataset(s): +~~~~~~~~~~~~~~~~~~ -The input is a json file: - -- `ReadbasedAnalysis.enabled_tools`: set the value of the tool as `true` to enable different profiling tools -- `ReadbasedAnalysis.db`: specify the path of the database -- `ReadbasedAnalysis.reads`: specify the path of the reads -- `ReadbasedAnalysis.prefix`: specify the prefix of output file names -- `ReadbasedAnalysis.outdir`: specify the path of output directory -- `ReadbasedAnalysis.cpu`: cpu numbers +Zymobiomics mock-community DNA control (SRR7877884); this dataset is ~7 GB. + +Input: A JSON file containing the following information: +1. selection of profiling tools (set as true if selected) +2. the paths to the required database(s) for the tools selected +3. the paths to the input fastq file(s) (paired-end data is shown; this can be the output of the Reads QC workflow in interleaved format which will be treated as single-end data.) +4. the prefix for the output file names +5. the path of the output directory +6. CPU number requested for the run. .. code-block:: JSON @@ -79,9 +97,9 @@ The input is a json file: "centrifuge": true }, "ReadbasedAnalysis.db": { - "gottcha2": "/global/cfs/projectdirs/m3408/aim2/database/gottcha2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna", - "kraken2": "/global/cfs/projectdirs/m3408/aim2/database/kraken2/", - "centrifuge": "/global/cfs/projectdirs/m3408/aim2/database/centrifuge/p_compressed" + "gottcha2": "/path/to/database/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna", + "kraken2": " /path/to/kraken2", + "centrifuge": "/path/to/centrifuge/p_compressed" }, "ReadbasedAnalysis.reads": [ "/path/to/SRR7877884.1.fastq.gz", @@ -93,41 +111,49 @@ The input is a json file: "ReadbasedAnalysis.cpu": 4 } - -Outputs +Output: ~~~~~~~ -The workflow creates individual output directories for each tool, including classification results, logs.:: +The workflow creates an output JSON file and individual output sub-directories for each tool which include tabular classification results, a tabular report, and a Krona plot (html).:: ReadbasedAnalysis/ + |-- SRR7877884.json |-- centrifuge - | |-- SRR7877884.classification.csv - | |-- SRR7877884.kreport.csv - | |-- SRR7877884.krona.html - | `-- SRR7877884.tsv + | |-- SRR7877884.classification.tsv + | |-- SRR7877884.report.tsv + | `-- SRR7877884.krona.html + | |-- gottcha2 | |-- SRR7877884.full.tsv | |-- SRR7877884.krona.html - | |-- SRR7877884.summary.tsv | `-- SRR7877884.tsv + | `-- kraken2 - |-- SRR7877884.classification.csv + |-- SRR7877884.classification.tsv |-- SRR7877884.krona.html - |-- SRR7877884.report.csv - `-- SRR7877884.tsv + `-- SRR7877884.report.tsv -Requirements for Execution --------------------------- +Below is an example of the output directory files with descriptions to the right. + +======================================== ============================================== +FileName Description +---------------------------------------- ---------------------------------------------- +SRR7877884.json ReadbasedAnalysis result JSON file +centrifuge/SRR7877884.classification.tsv Centrifuge output read classification TSV file +centrifuge/SRR7877884.report.tsv Centrifuge output report TSV file +centrifuge/SRR7877884.krona.html Centrifuge krona plot HTML file +gottcha2/SRR7877884.full.tsv GOTTCHA2 detail output TSV file +gottcha2/SRR7877884.tsv GOTTCHA2 output report TSV file +gottcha2/SRR7877884.krona.html GOTTCHA2 krona plot HTML file +kraken2/SRR7877884.classification.tsv Kraken2 output read classification TSV file +======================================== ============================================== -- Docker or other Container Runtime -- Cromwell or other WDL-capable Workflow Execution Tool -- 60 GB RAM Version History --------------- -- 1.0.0 +1.0.1 (release date 01/14/2021; previous versions: 1.0.0) Point of contact ----------------