Skip to content

Commit

Permalink
Merge branch 'dev' into MarieLataretu/issue55
Browse files Browse the repository at this point in the history
  • Loading branch information
MarieLataretu committed Sep 30, 2023
2 parents 001cbd6 + 28777b2 commit ca353fd
Show file tree
Hide file tree
Showing 19 changed files with 195 additions and 88 deletions.
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Changelog

## [v1.0.0-alpha] - 2023-09-30

### Changed

- changed input parameter usage:
- before: `--[nano|illumina|illumina_single_end|fasta]`
- now: `--input_type [nano|illumina|illumina_single_end|fasta] --input *.fastq`
- changed workflow figure to a nicer figure
- changed workflow structure (introducing subworkflows)
- input files with the suffix `clean` are not allowed

### Added

- added CHANGELOG.md, Citations.md and citation information
- added `--cleanup_work_dir` to remove work dir files after a successful run
- added `--min_clip` to filter mapped reads by soft-clipped length
- added `--dcs_strict` to use only DCS reads with artificial ends
- added `stub` command for Nextflow prototyping
- added `idxstats`

## Fixed

- pipeline report with timestamp
- `--split-prefix` parameter for `minimap2`
- make concat contamination more efficient
53 changes: 53 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# CLEAN: Citations

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
## Pipeline tools

- [BBMap](https://sourceforge.net/projects/bbmap/)

- [BEDTools](https://www.ncbi.nlm.nih.gov/pubmed/20110278/)

> Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.
- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [Minimap2](https://pubmed.ncbi.nlm.nih.gov/29750242/)
> Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191. PMID: 29750242; PMCID: PMC6137996.
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
- [NanoPlot](https://pubmed.ncbi.nlm.nih.gov/37171891/)

> Wouter C, Rademakers R. NanoPack2: Population scale evaluation of long-read sequencing data. Bioinformatics. 2023 May 12;39(5):btad311. doi: 10.1093/bioinformatics/btad311. Epub ahead of print. PMID: 37171891; PMCID: PMC10196664.
- [SAMtools](https://www.ncbi.nlm.nih.gov/pubmed/19505943/)

> Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.
- [QUAST](https://pubmed.ncbi.nlm.nih.gov/23422339/)

> Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806.
## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

> Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.
- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

> Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.
- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

> da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.
- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
> Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
3 changes: 0 additions & 3 deletions LICENCE

This file was deleted.

28 changes: 28 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
BSD 3-Clause License

Copyright (c) 2022, Martin Hölzer

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
20 changes: 16 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

A decontamination workflow for short reads, long reads and assemblies.

![](https://img.shields.io/badge/nextflow-19.10.0-brightgreen)
![](https://img.shields.io/badge/nextflow-21.04.0-brightgreen)
![](https://img.shields.io/badge/uses-docker-blue.svg)
![](https://img.shields.io/badge/uses-conda-yellow.svg)

Email: [email protected], marie.lataretu@uni-jena.de
Email: [email protected], lataretum@rki.de

## Objective

Expand Down Expand Up @@ -102,8 +102,20 @@ Included in this repository are:

... for reasons. More can be easily added! Just write me, add an issue or make a pull request.

## Flowchart
## Workflow

![chart](data/figures/workflow.png)

<sub><sub>The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates & nf-core under a CCO license (public domain).</sub></sub>
<sub><sub>The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates & nf-core under a CCO license (public domain).</sub></sub>

## Citations

If you use `CLEAN` in your work, please consider citing our preprint:

> Targeted decontamination of sequencing data with CLEAN
>
> Marie Lataretu, Sebastian Krautwurst, Adrian Viehweger, Christian Brandt, Martin Hölzer
>
> bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089
Additionally, an extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
13 changes: 10 additions & 3 deletions clean.nf
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Author: [email protected]

// Parameters sanity checking

Set valid_params = ['max_cores', 'cores', 'max_memory', 'memory', 'profile', 'help', 'input', 'input_type', 'list', 'host', 'own', 'control', 'keep', 'rm_rrna', 'bbduk', 'bbduk_kmer', 'bbduk_qin', 'reads_rna', 'min_clip', 'dcs_strict', 'output', 'multiqc_dir', 'nf_runinfo_dir', 'databases', 'condaCacheDir', 'singularityCacheDir', 'singularityCacheDir', 'cloudProcess', 'conda-cache-dir', 'singularity-cache-dir', 'cloud-process', 'publish_dir_mode'] // don't ask me why there is also 'conda-cache-dir', 'singularity-cache-dir', 'cloud-process'
Set valid_params = ['max_cores', 'cores', 'max_memory', 'memory', 'profile', 'help', 'input', 'input_type', 'list', 'host', 'own', 'control', 'keep', 'rm_rrna', 'bbduk', 'bbduk_kmer', 'bbduk_qin', 'reads_rna', 'min_clip', 'dcs_strict', 'output', 'multiqc_dir', 'nf_runinfo_dir', 'databases', 'cleanup_work_dir','condaCacheDir', 'singularityCacheDir', 'singularityCacheDir', 'cloudProcess', 'conda-cache-dir', 'singularity-cache-dir', 'cloud-process', 'publish_dir_mode'] // don't ask me why there is also 'conda-cache-dir', 'singularity-cache-dir', 'cloud-process'
def parameter_diff = params.keySet() - valid_params
if (parameter_diff.size() != 0){
exit 1, "ERROR: Parameter(s) $parameter_diff is/are not valid in the pipeline!\n"
Expand Down Expand Up @@ -143,6 +143,8 @@ if ( params.rm_rrna ){

if ( params.host ) {
hostNameChannel = Channel.from( params.host ).splitCsv().flatten()
} else {
hostNameChannel = Channel.empty()
}

// user defined fasta sequence
Expand Down Expand Up @@ -189,7 +191,7 @@ include { qc } from './workflows/qc_wf'
**************************/

workflow {
prepare_contamination(nanoControlFastaChannel, illuminaControlFastaChannel, rRNAChannel)
prepare_contamination(nanoControlFastaChannel, illuminaControlFastaChannel, rRNAChannel, hostNameChannel, ownFastaChannel)
contamination = prepare_contamination.out

clean(input_ch, contamination, nanoControlBedChannel)
Expand Down Expand Up @@ -266,7 +268,7 @@ def helpMSG() {
${c_green}--bbduk_qin${c_reset} set quality ASCII encoding for bbduk [default: $params.bbduk_qin; options are: 64, 33, auto]
${c_green}--reads_rna${c_reset} add this flag for noisy direct RNA-Seq Nanopore data [default: $params.reads_rna]
${c_green}--min_clip${c_reset} filter mapped reads by soft-clipped lenth (left + right). If >= 1 total
${c_green}--min_clip${c_reset} filter mapped reads by soft-clipped length (left + right). If >= 1 total
number; if < 1 relative to read length
${c_green}--dcs_strict${c_reset} filter out alignments that cover artificial ends of the ONT DCS to discriminate between Lambda Phage and DCS
Expand All @@ -287,6 +289,10 @@ def helpMSG() {
--condaCacheDir defines the path where environments (conda) are cached [default: $params.condaCacheDir]
--singularityCacheDir defines the path where images (singularity) are cached [default: $params.singularityCacheDir]
${c_yellow}Miscellaneous:${c_reset}
--cleanup_work_dir deletes all files in the work directory after a successful completion of a run [default: $params.cleanup_work_dir]
${c_dim}warning: if ture, the option will prevent the use of the resume feature!${c_reset}
${c_yellow}Profile:${c_reset}
You can merge different profiles for different setups, e.g.
Expand All @@ -303,6 +309,7 @@ def helpMSG() {
docker
singularity
conda
mamba
ebi (lsf,singularity; preconfigured for the EBI cluster)
yoda (lsf,singularity; preconfigured for the EBI YODA cluster)
Expand Down
1 change: 0 additions & 1 deletion configs/conda.config
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
process {
withLabel: basics { conda = "$baseDir/envs/basics.yaml" }
withLabel: minimap2 { conda = "$baseDir/envs/minimap2.yaml" }
withLabel: bbmap { conda = "$baseDir/envs/bbmap.yaml" }
withLabel: pysam { conda = "$baseDir/envs/pysam.yaml" }
Expand Down
3 changes: 1 addition & 2 deletions configs/container.config
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
process {
withLabel: basics { container = 'nanozoo/basics:1.0--9992be5' }
withLabel: smallTask { container = 'nanozoo/samtools:1.14--d8fb865' }
withLabel: minimap2 { container = 'nanozoo/minimap2:2.18--618fb68' }
withLabel: minimap2 { container = 'nanozoo/minimap2:2.26--ef54a1d' }
withLabel: bbmap { container = 'nanozoo/bbmap:38.79--8e915d7' }
withLabel: multiqc { container = 'nanozoo/multiqc:1.9--aba729b' }
withLabel: fastqc { container = 'nanozoo/fastqc:0.11.9--f61b8b4' }
Expand Down
1 change: 0 additions & 1 deletion configs/local.config
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
process {
withLabel: basics { cpus = params.cores }
withLabel: minimap2 { cpus = params.cores }
withLabel: bbmap { cpus = params.cores ; memory = params.memory }
withLabel: samclipy { cpus = 1 }
Expand Down
1 change: 0 additions & 1 deletion configs/node.config
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
process {
withLabel: basics { cpus = 8; memory = 8.GB }
withLabel: minimap2 { cpus = 24; memory = 24.GB }
withLabel: bbmap { cpus = 24; memory = 24.GB }
withLabel: smallTask { cpus = 1; memory = 2.GB }
Expand Down
6 changes: 0 additions & 6 deletions envs/basics.yaml

This file was deleted.

4 changes: 2 additions & 2 deletions envs/minimap2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@ channels:
- bioconda
- conda-forge
dependencies:
- minimap2=2.18
- samtools=1.11
- minimap2=2.26
- samtools=1.17
- pigz=2.3.4
Binary file added figures/clean_workflow_latest.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 18 additions & 10 deletions modules/alignment_processing.nf
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ process merge_bam {
tuple val(name), val(type), path(bam)

output:
tuple val(name), val(type), path("${bam[0].baseName}_merged.bam")
tuple val(name), val(type), path("${bam[0].baseName}_merged.bam")

script:
"""
Expand Down Expand Up @@ -111,34 +111,41 @@ process filter_true_dcs_alignments {

process fastq_from_bam {
label 'minimap2'
publishDir "${params.output}/${params.tool}/${name}", mode: 'copy', pattern: "*.gz"

input:
tuple val(name), val(type), path(bam)

output:
tuple val(name), val(type), path('*.fastq')
tuple val(name), val(type), path('*.fast*.gz')

script:
if ( params.lib_pairedness == 'paired' ) {
"""
samtools fastq -@ ${task.cpus} -1 ${bam.baseName}_1.fastq -2 ${bam.baseName}_2.fastq -s ${bam.baseName}_singleton.fastq ${bam}
gzip --no-name *.fastq
"""
} else if ( params.lib_pairedness == 'single' ) {
dtype = (params.input_type == 'fasta') ? 'a' : 'q'
"""
samtools fastq -@ ${task.cpus} -0 ${bam.baseName}.fastq ${bam}
samtools fastq -@ ${task.cpus} -0 ${bam.baseName}.fast${dtype} ${bam}
gzip --no-name *.fast${dtype}
"""
} else {
error "Invalid pairedness: ${params.lib_pairedness}"
}
stub:
dtype = (params.input_type == 'fasta') ? 'a' : 'q'
"""
touch ${bam.baseName}_1.fastq ${bam.baseName}_2.fastq
touch ${bam.baseName}_1.fast${dtype}.gz ${bam.baseName}_2.fast${dtype}.gz
"""
}

process idxstats_from_bam {
label 'minimap2'

publishDir "${params.output}/minimap2/${name}", mode: 'copy', pattern: "${bam.baseName}.idxstats.tsv"

input:
tuple val(name), val(type), path(bam), path(bai)

Expand All @@ -147,11 +154,11 @@ process idxstats_from_bam {

script:
"""
samtools idxstats ${bam} > ${bam.baseName}_idxstats.tsv
samtools idxstats ${bam} > ${bam.baseName}.idxstats.tsv
"""
stub:
"""
touch ${bam.baseName}_idxstats.tsv
touch ${bam.baseName}.idxstats.tsv
"""
}

Expand All @@ -168,11 +175,11 @@ process flagstats_from_bam {

script:
"""
samtools flagstats ${bam} > ${bam.baseName}_flagstats.txt
samtools flagstats ${bam} > ${bam.baseName}.flagstats.txt
"""
stub:
"""
touch ${bam.baseName}_flagstats.txt
touch ${bam.baseName}.flagstats.txt
"""
}

Expand All @@ -187,11 +194,12 @@ process sort_bam {

script:
"""
samtools sort -@ ${task.cpus} ${bam} > ${bam.baseName}.sorted.bam
mv ${bam} ${bam}.tmp
samtools sort -@ ${task.cpus} ${bam}.tmp > ${bam.baseName}.bam
"""
stub:
"""
touch ${bam.baseName}.sorted.bam
touch ${bam.baseName}.bam
"""
}

Expand Down
Loading

0 comments on commit ca353fd

Please sign in to comment.