Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize the results directory #67

Merged
merged 26 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
fb759f5
Disable all publishDir directives. Add back the publishing of 'keep' …
matthuska Nov 15, 2023
df2497e
Place fastq files that map to 'keep' genome and to the host/control/o…
matthuska Nov 16, 2023
f739f8b
Re-publish multiqc report to the qc/ directory (by default)
matthuska Nov 16, 2023
b2ec355
Only publish sorted.bam{,.bai} intermediate bam files
matthuska Nov 16, 2023
d91f94f
Publish idxstats and flagstats so they are next to their bam files
matthuska Nov 16, 2023
82b60c0
Handle --strict_dcs and --min_clip output in results dir
matthuska Nov 17, 2023
fc1318b
First steps towards making illumina/bbduk results dir look right
matthuska Nov 17, 2023
f03c8a5
Add bed_samtools container bump minimap2 container (from main)
matthuska Nov 19, 2023
619c205
--keep is now working with bbduk. results/ is properly populated.
matthuska Nov 21, 2023
4611647
Add seqkit conda environment file
matthuska Nov 21, 2023
e5e5659
Output fasta when input is fasta
matthuska Nov 21, 2023
753f97c
Add --no_intermediate argument to skip publishing intermediate files …
matthuska Nov 22, 2023
bb499ac
Add singularity/ dir to .gitignore
matthuska Nov 22, 2023
14840fb
Tweak filter_fastq_by_name to work properly, avoid reading and writin…
matthuska Nov 22, 2023
bb074a7
Replace scary zcat/paste/grep/tr with seqkit grep
matthuska Nov 22, 2023
40276ea
Add clarifying comments to the filter_fastq_by_name process
matthuska Nov 22, 2023
dc59d28
Remove overwrite: false from publishDir directives
matthuska Nov 22, 2023
39bd52e
fix stub touch
MarieLataretu Dec 10, 2023
d29a5ef
added seqkit to citations
MarieLataretu Dec 10, 2023
d2ae118
Add -keep to help mssg. Reformated --help a bit. Removed old profiles…
Dec 10, 2023
95ce2ee
adjusted the README a bit
Dec 10, 2023
975e05a
change sed command for --keep and add empty BED channel for phix control
Dec 10, 2023
814e37b
Publish host genome and index for loading into IGV. Skipped if interm…
matthuska Dec 11, 2023
5d87ac4
Switch to seqkit for check_own and concat_contamination
matthuska Dec 11, 2023
4228c4b
bump seqkit container
Dec 11, 2023
26bd1d0
Update fastq_from_bam to publish illumina data properly. Regexes are …
matthuska Dec 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ work/
results/
centrifuge-cloud/
conda/
.vscode/
singularity/
.vscode/
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@

> Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

- [SeqKit](https://pubmed.ncbi.nlm.nih.gov/27706213/)

> Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. PMID: 27706213; PMCID: PMC5051824.

- [QUAST](https://pubmed.ncbi.nlm.nih.gov/23422339/)

> Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806.
Expand Down
55 changes: 49 additions & 6 deletions README.md
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Technologies ([DNA CS (DCS)](https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5

## What this workflow does for you

With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence date. The results are the clean sequences and the sequences identified as contaminated.
With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence data. The results are the clean sequences and the sequences identified as contaminated.
Per default [minimap2](https://github.com/lh3/minimap2) is used for aligning your sequences to reference sequences but I recommend using `bbduk`, part of [BBTools](https://github.com/BioInfoTools/BBMap), to clean short-read data (_--bbduk_).

You can simply specify provided hosts and controls for the cleanup or use your own FASTA files. The reads are then mapped against the specified host, control and user defined FASTA files. All reads that map are considered as contamination. In case of Illumina paired-end reads, both mates need to be aligned.
Expand All @@ -35,7 +35,7 @@ We saw many soft-clipped reads after the mapping, that probably aren't contamina

### Dependencies management

- [Conda](https://docs.conda.io/en/latest/miniconda.html)
- [Conda](https://docs.conda.io/en/latest/miniconda.html)

and/or

Expand All @@ -57,7 +57,7 @@ Get help:
nextflow run hoelzer/clean --help
```

Clean Nanopore data by filtering against a combined reference of the _E. coli_ genome and the Nanopore DNA CS spike-in.
Clean Nanopore data by filtering against a combined reference of the _E. coli_ genome and the Nanopore DNA CS spike-in.

```bash
# uses Docker per default
Expand All @@ -77,7 +77,7 @@ nextflow run hoelzer/clean --input_type illumina --input '/home/martin/.nextflow
--own ~/.nextflow/assets/hoelzer/clean/test/ref.fasta.gz --bbduk
```

Clean some Illumina, Nanopore, and assembly files against the mouse and phiX genomes.
Clean some Illumina, Nanopore, and assembly files against the mouse and phiX genomes.

## Supported species and control sequences

Expand Down Expand Up @@ -108,14 +108,57 @@ Included in this repository are:

<sub><sub>The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates & nf-core under a CCO license (public domain).</sub></sub>

## Results

Running the pipeline will create a directory called `results/` in the current directory with some or all of the following directories and files (plus additional failes for indices, ...):

```text
results/
├── clean/
│ └── <sample_name>.fastq.gz
├── removed/
│ └── <sample_name>.fastq.gz
├── intermediate/
│ ├── map-to-remove/
│ │ ├── <sample_name>.mapped.fastq.gz
│ │ ├── <sample_name>.unmapped.fastq.gz
│ │ ├── <sample_name>.mapped.bam
│ │ ├── <sample_name>.unmapped.bam
│ │ ├── strict-dcs/
│ │ │ ├── <sample_name>.no-dcs.bam
│ │ │ ├── <sample_name>.true-dcs.bam
│ │ │ └── <sample_name>.false-dcs.bam
│ │ └── soft-clipped/
│ │ ├── <sample_name>.soft-clipped.bam
│ │ └── <sample_name>.passed-clipped.bam
│ └── map-to-keep/
│ ├── <sample_name>.mapped.fastq.gz
│ ├── <sample_name>.unmapped.fastq.gz
│ ├── <sample_name>.mapped.bam
│ ├── <sample_name>.unmapped.bam
│ ├── strict-dcs/
│ │ ├── <sample_name>.no-dcs.bam
│ │ ├── <sample_name>.true-dcs.bam
│ │ └── <sample_name>.false-dcs.bam
│ └── soft-clipped/
│ ├── <sample_name>.soft-clipped.bam
│ └── <sample_name>.passed-clipped.bam
├── logs/*.html
└── qc/multiqc_report.html
```

The most important files you are likely interested in are `results/clean/<sample_name>.fastq.gz`, which are the "cleaned" reads. These are the input reads that *do not* map to the host, control, own fasta or rRNA files (or the subset of these that you provided), plus those reads that map to the "keep" sequence if you used the `--keep` option. Any files that were removed from your input fasta file are placed in `results/removed/<sample_name>.fastq.gz`.

For debugging purposes we also provide various intermediate results in the `intermediate/` folder.

## Citations

If you use `CLEAN` in your work, please consider citing our preprint:

> Targeted decontamination of sequencing data with CLEAN
>
> Marie Lataretu, Sebastian Krautwurst, Adrian Viehweger, Christian Brandt, Martin Hölzer
>
> bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089
> bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089

Additionally, an extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
Loading