Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fastp #210

Merged
merged 12 commits into from
Dec 5, 2022
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

### `Added`

- Template update to nf-core tools v2.6
- [#209](https://github.com/nf-core/airrflow/pull/209) Template update to nf-core tools v2.6.
- [#210](https://github.com/nf-core/airrflow/pull/210) Add fastp for read QC, adapter trimming and read clipping.

## [2.3.0] - 2022-09-22 "Expelliarmus"

Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [Fastp](https://doi.org/10.1093/bioinformatics/bty560)

> Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics. 2018 Sept 1; 34(17):i884–i890. doi: 10.1093/bioinformatics/bty560.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ On release, automated continuous integration tests run the pipeline on a full-si

By default, the pipeline currently performs the following steps:

- Raw read quality control (`FastQC`)
- Raw read quality control, adapter trimming and read clipping (`fastp`)
- Pre-processing (`pRESTO`)
- Filtering sequences by sequencing quality.
- Masking amplicon primers.
Expand All @@ -35,6 +35,7 @@ By default, the pipeline currently performs the following steps:
- Assembling R1 and R2 read mates.
- Removing and annotating read duplicates with different UMI barcodes.
- Filtering out sequences that do not have at least 2 duplicates.
- Post-assembly read quality control (`FastQC`s)
- Assigning gene segment alleles with `IgBlast` using the IMGT database (`Change-O`).
- Finding the Hamming distance threshold for clone definition (`SHazaM`).
- Clonal assignment: defining clonal lineages of the B-cell / T-cell populations (`Change-O`).
Expand Down
32 changes: 30 additions & 2 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,36 @@ process {
]
}

withName: FASTQC {
ext.args = '--quiet'
withName: 'FASTP' {
publishDir = [
[
path: { "${params.outdir}/fastp/${meta.id}" },
mode: params.publish_dir_mode,
pattern: "*.{html,json,log}"
],
[
enabled: params.save_trimmed,
path: { "${params.outdir}/fastp/${meta.id}/" },
mode: params.publish_dir_mode,
pattern: "*.fastp.fastq.gz"
]
]
ext.args = [ "--disable_quality_filtering --disable_length_filtering",
params.trim_fastq ?: "--disable_adapter_trimming",
params.clip_r1 > 0 ? "--trim_front1 ${params.clip_r1}" : "", // Remove bp from the 5' end of read 1
params.clip_r2 > 0 ? "--trim_front2 ${params.clip_r2}" : "", // Remove bp from the 5' end of read 2
params.three_prime_clip_r1 > 0 ? "--trim_tail1 ${params.three_prime_clip_r1}" : "", // Remove bp from the 3' end of read 1 AFTER adapter/quality trimming has been performed
params.three_prime_clip_r2 > 0 ? "--trim_tail2 ${params.three_prime_clip_r2}" : "", // Remove bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed
params.trim_nextseq ? "--trim_poly_g" : "", // Apply the --nextseq=X option, to trim based on quality after removing poly-G tails
].join(" ").trim()
}

withName: 'GUNZIP_*' {
publishDir = [
[
enabled: false
]
]
}

withName: FASTQC_POSTASSEMBLY {
Expand Down
48 changes: 31 additions & 17 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The directories listed below will be created in the results directory after the

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [FastQC](#fastqc) - read quality control
- [FastP](#fastp) - read quality control, adapter trimming and read clipping
- [pRESTO](#presto) - read pre-processing
- [Filter by sequence quality](#filter-by-sequence-quality) - filter sequences by quality
- [Mask primers](#mask-primers) - Masking primers
Expand All @@ -21,6 +21,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Assemble mates](#assemble-mates) - Assemble sequence mates.
- [Remove duplicates](#remove-duplicates) - Remove and annotate read duplicates.
- [Filter sequences for at least 2 representative](#filter-sequences-for-at-least-2-representative) Filter sequences that do not have at least 2 duplicates.
- [FastQC](#fastqc) - read quality control post-assembly
- [Change-O](#change-o) - Assign genes and clonotyping
- [Assign genes with Igblast](#assign-genes-with-igblast)
- [Make database from assigned genes](#make-database-from-assigned-genes)
Expand All @@ -39,29 +40,20 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [MultiQC](#MultiQC) - MultiQC
- [Pipeline information](#pipeline-information) - Pipeline information

## FastQC
## Fastp

<details markdown="1">
<summary>Output files</summary>

- `fastqc/`
- `*_fastqc.html`: FastQC report containing quality metrics for the raw unmated reads.
- `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images for the raw unmated reads.
- `postassembly/`
- `*_ASSEMBLED_fastqc.html`: FastQC report containing quality metrics for the mated and quality filtered reads.
- `*_ASSEMBLED_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images for the mated and quality filtered reads.
- `fastp/`
- `<sample_id>/`
- `*.fastp.html`: Fast report containing quality metrics for the mated and quality filtered reads.
- `*.fastp.json`: Zip archive containing the FastQC report, tab-delimited data file and plot images for the mated and quality filtered reads.
- `*.fastp.log`: Fastp

</details>

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).

![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)

![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)

![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)

> **NB:** Two sets of FastQC plots are displayed in the MultiQC report: first for the raw _untrimmed_ and unmated reads and secondly for the assembled and QC filtered reads (but before collapsing duplicates). They may contain adapter sequence and potentially regions with low quality.
[fastp](https://doi.org/10.1093/bioinformatics/bty560) gives general quality metrics about your sequenced reads, as well as allows filtering reads by quality, trimming adapters and clipping reads at 5' or 3' ends. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [fastp documentation](https://github.com/OpenGene/fastp).

## presto

Expand Down Expand Up @@ -193,6 +185,28 @@ Remove duplicates using [CollapseSeq](https://presto.readthedocs.io/en/version-0

Remove sequences which do not have 2 representative using [SplitSeq](https://presto.readthedocs.io/en/version-0.5.11/tools/SplitSeq.html) from the pRESTO Immcantation toolset.

## FastQC

<details markdown="1">
<summary>Output files</summary>

- `fastqc/`
- `postassembly/`
- `*_ASSEMBLED_fastqc.html`: FastQC report containing quality metrics for the mated and quality filtered reads.
- `*_ASSEMBLED_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images for the mated and quality filtered reads.

</details>

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).

![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)

![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)

![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)

> **NB:** Two sets of FastQC plots are displayed in the MultiQC report: first for the raw _untrimmed_ and unmated reads and secondly for the assembled and QC filtered reads (but before collapsing duplicates). They may contain adapter sequence and potentially regions with low quality.

## Change-O

### Assign genes with Igblast
Expand Down
4 changes: 4 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@
"branch": "master",
"git_sha": "8022c68e7403eecbd8ba9c49496f69f8c49d50f0"
},
"fastp": {
"branch": "master",
"git_sha": "1e49f31e93c56a3832833eef90a02d3cde5a3f7e"
},
"fastqc": {
"branch": "master",
"git_sha": "5e34754d42cd2d5d248ca8673c0a53cdf5624905"
Expand Down
103 changes: 103 additions & 0 deletions modules/nf-core/fastp/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

73 changes: 73 additions & 0 deletions modules/nf-core/fastp/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,16 @@ params {
umi_length = -1
umi_start = 0

// trimming options
trim_fastq = true
adapter_fasta = null
clip_r1 = 0
clip_r2 = 0
three_prime_clip_r1 = 0
three_prime_clip_r2 = 0
trim_nextseq = false
save_trimmed = false

// pRESTO options
filterseq_q = 20
primer_maxerror = 0.2
Expand Down
Loading