nf-core · ggabernet · Dec 5, 2022 · Dec 3, 2022 · Dec 3, 2022 · Dec 4, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,7 +7,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 
 ### `Added`
 
-- Template update to nf-core tools v2.6
+- [#209](https://github.com/nf-core/airrflow/pull/209) Template update to nf-core tools v2.6.
+- [#210](https://github.com/nf-core/airrflow/pull/210) Add fastp for read QC, adapter trimming and read clipping.
 
 ## [2.3.0] - 2022-09-22 "Expelliarmus"
 

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -12,6 +12,10 @@
 
 - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
 
+- [Fastp](https://doi.org/10.1093/bioinformatics/bty560)
+
+  > Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics. 2018 Sept 1; 34(17):i884–i890. doi: 10.1093/bioinformatics/bty560.
+
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 
   > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

diff --git a/README.md b/README.md
@@ -24,7 +24,7 @@ On release, automated continuous integration tests run the pipeline on a full-si
 
 By default, the pipeline currently performs the following steps:
 
-- Raw read quality control (`FastQC`)
+- Raw read quality control, adapter trimming and read clipping (`fastp`)
 - Pre-processing (`pRESTO`)
   - Filtering sequences by sequencing quality.
   - Masking amplicon primers.
@@ -35,6 +35,7 @@ By default, the pipeline currently performs the following steps:
   - Assembling R1 and R2 read mates.
   - Removing and annotating read duplicates with different UMI barcodes.
   - Filtering out sequences that do not have at least 2 duplicates.
+- Post-assembly read quality control (`FastQC`s)
 - Assigning gene segment alleles with `IgBlast` using the IMGT database (`Change-O`).
 - Finding the Hamming distance threshold for clone definition (`SHazaM`).
 - Clonal assignment: defining clonal lineages of the B-cell / T-cell populations (`Change-O`).

diff --git a/conf/modules.config b/conf/modules.config
@@ -36,8 +36,36 @@ process {
             ]
         }
 
-        withName: FASTQC {
-            ext.args = '--quiet'
+        withName: 'FASTP' {
+            publishDir = [
+                [
+                    path: { "${params.outdir}/fastp/${meta.id}" },
+                    mode: params.publish_dir_mode,
+                    pattern: "*.{html,json,log}"
+                ],
+                [
+                    enabled: params.save_trimmed,
+                    path: { "${params.outdir}/fastp/${meta.id}/" },
+                    mode: params.publish_dir_mode,
+                    pattern: "*.fastp.fastq.gz"
+                ]
+            ]
+            ext.args = [ "--disable_quality_filtering --disable_length_filtering",
+                params.trim_fastq              ?: "--disable_adapter_trimming",
+                params.clip_r1 > 0             ? "--trim_front1 ${params.clip_r1}"            : "", // Remove bp from the 5' end of read 1
+                params.clip_r2   > 0           ? "--trim_front2 ${params.clip_r2}"            : "", // Remove bp from the 5' end of read 2
+                params.three_prime_clip_r1 > 0 ? "--trim_tail1 ${params.three_prime_clip_r1}" : "", // Remove bp from the 3' end of read 1 AFTER adapter/quality trimming has been performed
+                params.three_prime_clip_r2 > 0 ? "--trim_tail2 ${params.three_prime_clip_r2}" : "", // Remove bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed
+                params.trim_nextseq            ? "--trim_poly_g"                              : "", // Apply the --nextseq=X option, to trim based on quality after removing poly-G tails
+            ].join(" ").trim()
+        }
+
+        withName: 'GUNZIP_*' {
+            publishDir = [
+                [
+                    enabled: false
+                ]
+            ]
         }
 
         withName: FASTQC_POSTASSEMBLY {

diff --git a/docs/output.md b/docs/output.md
@@ -10,7 +10,7 @@ The directories listed below will be created in the results directory after the
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [FastQC](#fastqc) - read quality control
+- [FastP](#fastp) - read quality control, adapter trimming and read clipping
 - [pRESTO](#presto) - read pre-processing
   - [Filter by sequence quality](#filter-by-sequence-quality) - filter sequences by quality
   - [Mask primers](#mask-primers) - Masking primers
@@ -21,6 +21,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
   - [Assemble mates](#assemble-mates) - Assemble sequence mates.
   - [Remove duplicates](#remove-duplicates) - Remove and annotate read duplicates.
   - [Filter sequences for at least 2 representative](#filter-sequences-for-at-least-2-representative) Filter sequences that do not have at least 2 duplicates.
+- [FastQC](#fastqc) - read quality control post-assembly
 - [Change-O](#change-o) - Assign genes and clonotyping
   - [Assign genes with Igblast](#assign-genes-with-igblast)
   - [Make database from assigned genes](#make-database-from-assigned-genes)
@@ -39,29 +40,20 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [MultiQC](#MultiQC) - MultiQC
 - [Pipeline information](#pipeline-information) - Pipeline information
 
-## FastQC
+## Fastp
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `fastqc/`
-  - `*_fastqc.html`: FastQC report containing quality metrics for the raw unmated reads.
-  - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images for the raw unmated reads.
-  - `postassembly/`
-    - `*_ASSEMBLED_fastqc.html`: FastQC report containing quality metrics for the mated and quality filtered reads.
-    - `*_ASSEMBLED_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images for the mated and quality filtered reads.
+- `fastp/`
+  - `<sample_id>/`
+    - `*.fastp.html`: Fast report containing quality metrics for the mated and quality filtered reads.
+    - `*.fastp.json`: Zip archive containing the FastQC report, tab-delimited data file and plot images for the mated and quality filtered reads.
+    - `*.fastp.log`: Fastp
 
 </details>
 
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
-
-![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
-
-![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
-
-![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
-
-> **NB:** Two sets of FastQC plots are displayed in the MultiQC report: first for the raw _untrimmed_ and unmated reads and secondly for the assembled and QC filtered reads (but before collapsing duplicates). They may contain adapter sequence and potentially regions with low quality.
+[fastp](https://doi.org/10.1093/bioinformatics/bty560) gives general quality metrics about your sequenced reads, as well as allows filtering reads by quality, trimming adapters and clipping reads at 5' or 3' ends. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [fastp documentation](https://github.com/OpenGene/fastp).
 
 ## presto
 
@@ -193,6 +185,28 @@ Remove duplicates using [CollapseSeq](https://presto.readthedocs.io/en/version-0
 
 Remove sequences which do not have 2 representative using [SplitSeq](https://presto.readthedocs.io/en/version-0.5.11/tools/SplitSeq.html) from the pRESTO Immcantation toolset.
 
+## FastQC
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `fastqc/`
+  - `postassembly/`
+    - `*_ASSEMBLED_fastqc.html`: FastQC report containing quality metrics for the mated and quality filtered reads.
+    - `*_ASSEMBLED_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images for the mated and quality filtered reads.
+
+</details>
+
+[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
+
+![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
+
+![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
+
+![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
+
+> **NB:** Two sets of FastQC plots are displayed in the MultiQC report: first for the raw _untrimmed_ and unmated reads and secondly for the assembled and QC filtered reads (but before collapsing duplicates). They may contain adapter sequence and potentially regions with low quality.
+
 ## Change-O
 
 ### Assign genes with Igblast

diff --git a/modules.json b/modules.json
@@ -9,6 +9,10 @@
                         "branch": "master",
                         "git_sha": "8022c68e7403eecbd8ba9c49496f69f8c49d50f0"
                     },
+                    "fastp": {
+                        "branch": "master",
+                        "git_sha": "1e49f31e93c56a3832833eef90a02d3cde5a3f7e"
+                    },
                     "fastqc": {
                         "branch": "master",
                         "git_sha": "5e34754d42cd2d5d248ca8673c0a53cdf5624905"

diff --git a/modules/nf-core/fastp/main.nf b/modules/nf-core/fastp/main.nf
diff --git a/modules/nf-core/fastp/meta.yml b/modules/nf-core/fastp/meta.yml
diff --git a/nextflow.config b/nextflow.config
@@ -40,6 +40,16 @@ params {
     umi_length = -1
     umi_start = 0
 
+    // trimming options
+    trim_fastq = true
+    adapter_fasta = null
+    clip_r1 = 0
+    clip_r2 = 0
+    three_prime_clip_r1 = 0
+    three_prime_clip_r2 = 0
+    trim_nextseq = false
+    save_trimmed = false
+
     // pRESTO options
     filterseq_q = 20
     primer_maxerror = 0.2