Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
vlilanl authored Nov 15, 2024
1 parent a360f3c commit 3701987
Showing 1 changed file with 38 additions and 30 deletions.
68 changes: 38 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
# The Data Preprocessing workflow
# The Data Preprocessing Workflow

## Summary

This workflow is a replicate of the [QA protocol](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/data-preprocessing/) implemented at JGI for Illumina reads and use the program “rqcfilter2” from BBTools(38:96) which implements them as a pipeline.
This workflow is a replicate of the [QA protocol](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/data-preprocessing/) implemented at JGI for Illumina reads.

This workflow utilizes the program `rqcfilter2` from BBTools to perform quality control on raw Illumina reads for **shortreads**. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using `BBDuk`), and performs human/cat/dog/mouse/microbe removal (using `BMap`).

This workflow performs quality control on long reads from PacBio. The workflow performs duplicate removal (using `pbmarkdup`), inverted repeat filtering (using BBTools
`icecreamfinder.sh`), adapter trimming, and final filtering of reads with residual adapter sequences (using `bbduk`). The workflow is designed to handle input files in various formats, including .bam, .fq, or .fq.gz.

## Required Database

Expand All @@ -17,33 +22,27 @@ This workflow is a replicate of the [QA protocol](https://jgi.doe.gov/data-and-t
rm RQCFilterData.tgz
```

## Running Workflow in Cromwell

Description of the files:
- `.wdl` file: the WDL file for workflow definition
- `.json` file: the example input for the workflow
- `.conf` file: the conf file for running Cromwell.
- `.sh` file: the shell script for running the example workflow

## The Docker image and Dockerfile can be found here

[microbiomedata/bbtools:38.92](https://hub.docker.com/r/microbiomedata/bbtools)
[microbiomedata/bbtools:38.96](https://hub.docker.com/r/microbiomedata/bbtools)

## Input files

1. database path,
2. fastq (illumina paired-end interleaved fastq),
3. project name
4. resource where run the workflow
5. informed_by
1. the path to the interleaved fastq file (longreads and shortreads)
2. forwards reads fastq file (when input_interleaved is false)
3. reverse reads fastq file (when input_interleaved is false)
4. project id
5. if the input is interleaved (boolean)
6. if the input is shortreads (boolean)

```
{
"nmdc_rqcfilter.database": "/global/cfs/projectdirs/m3408/aim2/database",
"nmdc_rqcfilter.input_files": "/global/cfs/cdirs/m3408/ficus/8434.3.102077.AGTTCC.fastq.gz",
"nmdc_rqcfilter.proj":"nmdc:xxxxxxx",
"nmdc_rqcfilter.resouce":"NERSC -- perlmutter",
"nmdc_rqcfilter.informed_by": "nmdc:xxxxxxxx"
"rqcfilter.input_files": ["https://portal.nersc.gov/project/m3408//test_data/smalltest.int.fastq.gz"],
"rqcfilter.input_fq1": [],
"rqcfilter.input_fq2": [],
"rqcfilter.proj": "nmdc:xxxxxxx",
"rqcfilter.interleaved": true,
"rqcfilter.shortRead": true
}
```

Expand All @@ -54,13 +53,22 @@ The output will have one directory named by prefix of the fastq input file and a
The main QC fastq output is named by prefix.anqdpht.fast.gz.

```
|-- 8434.1.102069.ACAGTG.anqdpht.fastq.gz
|-- filterStats.txt
|-- filterStats.json
|-- filterStats2.txt
|-- adaptersDetected.fa
|-- reproduce.sh
|-- spikein.fq.gz
|-- status.log
|-- ...
* Short Reads
output/
├── nmdc_xxxxxxx_filtered.fastq.gz
├── nmdc_xxxxxxx_filterStats.txt
├── nmdc_xxxxxxx_filterStats2.txt
├── nmdc_xxxxxxx_readsQC.info
└── nmdc_xxxxxxx_qa_stats.json
# Long Reads
output/
├── nmdc_xxxxxxx_pbmarkdupStats.txt
├── nmdc_xxxxxxx_readsQC.info
├── nmdc_xxxxxxx_bbdukEndsStats.json
├── nmdc_xxxxxxx_icecreamStats.json
├── nmdc_xxxxxxx_filtered.fastq.gz
└── nmdc_xxxxxxx_stats.json
```

## Link to Doc Site
Please refer [here](https://nmdc-workflow-documentation.readthedocs.io/en/latest/chapters/1_RQC_index.html) for more information.

0 comments on commit 3701987

Please sign in to comment.