Skip to content

Commit

Permalink
Merge pull request #8 from microbiomedata/develop
Browse files Browse the repository at this point in the history
Update readme files
  • Loading branch information
poeli authored Feb 18, 2021
2 parents 8aff2ea + 8ccc934 commit b5c7844
Show file tree
Hide file tree
Showing 2 changed files with 183 additions and 128 deletions.
155 changes: 92 additions & 63 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,74 +1,92 @@
The Read-based Analysis Workflow
================================

Summary
-------

The pipeline takes sequencing files (single- or paired-end) and profiles them using multiple taxonomic classification tools with the Cromwell as the workflow manager.

Workflow Diagram
----------------

.. image:: docs/readbased_analysis_workflow.png
:align: center
:scale: 50%

Workflow Overview
-----------------
The pipeline takes in sequencing files (single- or paired-end) and profiles them using multiple taxonomic classification tools with the Cromwell as the workflow manager.

Workflow Availability
---------------------
The workflow is available in GitHub: https://github.com/microbiomedata/ReadbasedAnalysis; the corresponding Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/nmdc_taxa_profilers

Requirements for Execution:
~~~~~~~~~~~~~~~~~~~~~~~~~~~

(recommendations are in **bold**)

- WDL-capable Workflow Execution Tool (**Cromwell**)
- Container Runtime that can load Docker images (**Docker v2.1.0.3 or higher**)

Hardware Requirements:
~~~~~~~~~~~~~~~~~~~~~~
- Disk space: 152 GB for databases (55 GB, 89 GB, and 8 GB for GOTTCHA2, Kraken2 and Centrifuge databases, respectively)
- 60 GB RAM

Workflow Dependencies
---------------------

Third party software
Third party software:
~~~~~~~~~~~~~~~~~~~~~

(These are included in the Docker image.)

- `GOTTCHA2 v2.1.6 <https://github.com/poeli/GOTTCHA2>`_ (License: `BSD-3-Clause-LANL <https://github.com/poeli/GOTTCHA2/blob/master/LICENSE>`_)
- `Kraken2 v2.0.8 <http://ccb.jhu.edu/software/kraken2>`_ (License: `MIT <https://github.com/DerrickWood/kraken2/blob/master/LICENSE>`_)
- `Centrifuge v1.0.4 <http://www.ccb.jhu.edu/software/centrifuge>`_ (License: `GPL-3 <https://github.com/DaehwanKimLab/centrifuge/blob/master/LICENSE>`_)

Requisite databases:
~~~~~~~~~~~~~~~~~~~~

- GOTTCHA2: 2.1.6 `(BSD-3-Clause-LANL) <https://github.com/poeli/GOTTCHA2/blob/master/LICENSE>`_
- Kraken2: 2.0.8 `(MIT) <https://github.com/DerrickWood/kraken2/blob/master/LICENSE>`_
- Centrifuge: 1.0.4 `(GPL-3) <https://github.com/DaehwanKimLab/centrifuge/blob/master/LICENSE>`_
The database for each tool must be downloaded and installed. These databases total 152 GB.
- GOTTCHA2 database (gottcha2/):

Database
~~~~~~~~
The database RefSeqr90.cg.BacteriaArchaeaViruses.species.fna contains complete genomes of bacteria, archaea and viruses from RefSeq Release 90. The following commands will download the database:

Each profiling tool requires databases stored in sub-directories at `/global/cfs/projectdirs/m3408/aim2/database/`.
::

- GOTTCHA2 database (gottcha2/): The database `RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna` is built by complete genomes of bacteria, archaea and viruses from RefSeq Release 90.
- Kraken2 database (kraken2/): This is a standard Kraken 2 database, built by NCBI RefSeq genomes.
- Centrifuge database (centrifuge/)
wget https://edge-dl.lanl.gov/GOTTCHA2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
tar -xvf RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
rm RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar

Workflow Availability
---------------------
- Kraken2 database (kraken2/):

This is a standard Kraken 2 database, built from NCBI RefSeq genomes. The following commands will download the database:

The workflow is available in GitHub:
https://github.com/microbiomedata/ReadbasedAnalysis
::

The container is available at Docker Hub (microbiomedata/nmdc_taxa_profilers):
https://hub.docker.com/r/microbiomedata/nmdc_taxa_profilers
mkdir kraken2
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20201202.tar.gz
tar -xzvf k2_standard_20201202.tar.gz -C kraken2
rm k2_standard_20201202.tar.gz

- Centrifuge database (centrifuge/):

Running Workflow in Cromwell
----------------------------
This is a compressed database built from RefSeq genomes of Bacteria and Archaea. The following commands will download the database:

Description of the files:
::

- `.wdl`: the WDL file for read-based analysis pipeline.
- `.wdl`: the WDL file for tasks of each tool.
- `.json`: the example inputs.json file for the pipeline.
- `.conf`: the conf file for running cromwell.
- `.job`: example sbatch file.
mkdir centrifuge
wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed_2018_4_15.tar.gz
tar -xzvf p_compressed_2018_4_15.tar.gz -C centrifuge
rm p_compressed_2018_4_15.tar.gz

Test datasets
-------------

Zymobiomics mock-community DNA control `(SRR7877884) <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_
Sample dataset(s):
~~~~~~~~~~~~~~~~~~

Inputs
~~~~~~
Zymobiomics mock-community DNA control (SRR7877884); this dataset is ~7 GB.

The input is a json file:

- `ReadbasedAnalysis.enabled_tools`: set the value of the tool as `true` to enable different profiling tools
- `ReadbasedAnalysis.db`: specify the path of the database
- `ReadbasedAnalysis.reads`: specify the path of the reads
- `ReadbasedAnalysis.prefix`: specify the prefix of output file names
- `ReadbasedAnalysis.outdir`: specify the path of output directory
- `ReadbasedAnalysis.cpu`: cpu numbers
Input: A JSON file containing the following information:
1. selection of profiling tools (set as true if selected)
2. the paths to the required database(s) for the tools selected
3. the paths to the input fastq file(s) (paired-end data is shown; this can be the output of the Reads QC workflow in interleaved format which will be treated as single-end data.)
4. the prefix for the output file names
5. the path of the output directory
6. CPU number requested for the run.

.. code-block:: JSON
Expand All @@ -79,9 +97,9 @@ The input is a json file:
"centrifuge": true
},
"ReadbasedAnalysis.db": {
"gottcha2": "/global/cfs/projectdirs/m3408/aim2/database/gottcha2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna",
"kraken2": "/global/cfs/projectdirs/m3408/aim2/database/kraken2/",
"centrifuge": "/global/cfs/projectdirs/m3408/aim2/database/centrifuge/p_compressed"
"gottcha2": "/path/to/database/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna",
"kraken2": " /path/to/kraken2",
"centrifuge": "/path/to/centrifuge/p_compressed"
},
"ReadbasedAnalysis.reads": [
"/path/to/SRR7877884.1.fastq.gz",
Expand All @@ -90,41 +108,52 @@ The input is a json file:
"ReadbasedAnalysis.paired": true,
"ReadbasedAnalysis.prefix": "SRR7877884",
"ReadbasedAnalysis.outdir": "/path/to/ReadbasedAnalysis",
"ReadbasedAnalysis.cpu": 8
"ReadbasedAnalysis.cpu": 4
}
Outputs
Output:
~~~~~~~

The workflow creates individual output directories for each tool, including classification results, logs.::
The workflow creates an output JSON file and individual output sub-directories for each tool which include tabular classification results, a tabular report, and a Krona plot (html).::

ReadbasedAnalysis/
|-- SRR7877884.json
|-- centrifuge
| |-- SRR7877884.classification.csv
| |-- SRR7877884.krona.html
| `-- SRR7877884.report.tsv
| |-- SRR7877884.classification.tsv
| |-- SRR7877884.report.tsv
| `-- SRR7877884.krona.html
|
|-- gottcha2
| |-- SRR7877884.full.tsv
| |-- SRR7877884.krona.html
| `-- SRR7877884.tsv
|
`-- kraken2
|-- SRR7877884.classification.csv
|-- SRR7877884.classification.tsv
|-- SRR7877884.krona.html
`-- SRR7877884.report.csv
`-- SRR7877884.report.tsv


Below is an example of the output directory files with descriptions to the right.

Requirements for Execution
--------------------------
======================================== ==============================================
FileName Description
---------------------------------------- ----------------------------------------------
SRR7877884.json ReadbasedAnalysis result JSON file
centrifuge/SRR7877884.classification.tsv Centrifuge output read classification TSV file
centrifuge/SRR7877884.report.tsv Centrifuge output report TSV file
centrifuge/SRR7877884.krona.html Centrifuge krona plot HTML file
gottcha2/SRR7877884.full.tsv GOTTCHA2 detail output TSV file
gottcha2/SRR7877884.tsv GOTTCHA2 output report TSV file
gottcha2/SRR7877884.krona.html GOTTCHA2 krona plot HTML file
kraken2/SRR7877884.classification.tsv Kraken2 output read classification TSV file
======================================== ==============================================

- Docker or other Container Runtime
- Cromwell or other WDL-capable Workflow Execution Tool
- 50 GB RAM

Version History
---------------

- 0.0.1
1.0.1 (release date 01/14/2021; previous versions: 1.0.0)

Point of contact
----------------
Expand Down
Loading

0 comments on commit b5c7844

Please sign in to comment.