Merge pull request #67 from hoelzer/feature/reorganize-results

Reorganize the results directory
rki-mf1 · Dec 14, 2023 · c22c7c1 · c22c7c1
2 parents e87a382 + 26bd1d0
commit c22c7c1
Show file tree

Hide file tree

Showing 15 changed files with 401 additions and 325 deletions.
diff --git a/.gitignore b/.gitignore
@@ -9,4 +9,5 @@ work/
 results/
 centrifuge-cloud/
 conda/
-.vscode/
+singularity/
+.vscode/
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -29,6 +29,10 @@
 
   > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.
 
+- [SeqKit](https://pubmed.ncbi.nlm.nih.gov/27706213/)
+
+  > Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. PMID: 27706213; PMCID: PMC5051824.
+
 - [QUAST](https://pubmed.ncbi.nlm.nih.gov/23422339/)
 
   > Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806.

diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ Technologies ([DNA CS (DCS)](https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5
 
 ## What this workflow does for you
 
-With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence date. The results are the clean sequences and the sequences identified as contaminated.
+With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence data. The results are the clean sequences and the sequences identified as contaminated.
 Per default [minimap2](https://github.com/lh3/minimap2) is used for aligning your sequences to reference sequences but I recommend using `bbduk`, part of [BBTools](https://github.com/BioInfoTools/BBMap), to clean short-read data (_--bbduk_).
 
 You can simply specify provided hosts and controls for the cleanup or use your own FASTA files. The reads are then mapped against the specified host, control and user defined FASTA files. All reads that map are considered as contamination. In case of Illumina paired-end reads, both mates need to be aligned.
@@ -35,7 +35,7 @@ We saw many soft-clipped reads after the mapping, that probably aren't contamina
 
 ### Dependencies management
 
-- [Conda](https://docs.conda.io/en/latest/miniconda.html) 
+- [Conda](https://docs.conda.io/en/latest/miniconda.html)
 
 and/or
 
@@ -57,7 +57,7 @@ Get help:
 nextflow run hoelzer/clean --help
 ```
 
-Clean Nanopore data by filtering against a combined reference of the _E. coli_ genome and the Nanopore DNA CS spike-in.  
+Clean Nanopore data by filtering against a combined reference of the _E. coli_ genome and the Nanopore DNA CS spike-in.
 
 ```bash
 # uses Docker per default
@@ -77,7 +77,7 @@ nextflow run hoelzer/clean --input_type illumina --input '/home/martin/.nextflow
 --own ~/.nextflow/assets/hoelzer/clean/test/ref.fasta.gz --bbduk
 ```
 
-Clean some Illumina, Nanopore, and assembly files against the mouse and phiX genomes.  
+Clean some Illumina, Nanopore, and assembly files against the mouse and phiX genomes.
 
 ## Supported species and control sequences
 
@@ -108,14 +108,57 @@ Included in this repository are:
 
 <sub><sub>The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates & nf-core under a CCO license (public domain).</sub></sub>
 
+## Results
+
+Running the pipeline will create a directory called `results/` in the current directory with some or all of the following directories and files (plus additional failes for indices, ...):
+
+```text
+results/
+├── clean/
+│   └── <sample_name>.fastq.gz
+├── removed/
+│   └── <sample_name>.fastq.gz
+├── intermediate/
+│   ├── map-to-remove/
+│   │   ├── <sample_name>.mapped.fastq.gz
+│   │   ├── <sample_name>.unmapped.fastq.gz
+│   │   ├── <sample_name>.mapped.bam
+│   │   ├── <sample_name>.unmapped.bam
+│   │   ├── strict-dcs/
+│   │   │   ├── <sample_name>.no-dcs.bam
+│   │   │   ├── <sample_name>.true-dcs.bam
+│   │   │   └── <sample_name>.false-dcs.bam
+│   │   └── soft-clipped/
+│   │       ├── <sample_name>.soft-clipped.bam
+│   │       └── <sample_name>.passed-clipped.bam
+│   └── map-to-keep/
+│       ├── <sample_name>.mapped.fastq.gz
+│       ├── <sample_name>.unmapped.fastq.gz
+│       ├── <sample_name>.mapped.bam
+│       ├── <sample_name>.unmapped.bam
+│       ├── strict-dcs/
+│       │   ├── <sample_name>.no-dcs.bam
+│       │   ├── <sample_name>.true-dcs.bam
+│       │   └── <sample_name>.false-dcs.bam
+│       └── soft-clipped/
+│           ├── <sample_name>.soft-clipped.bam
+│           └── <sample_name>.passed-clipped.bam
+├── logs/*.html
+└── qc/multiqc_report.html
+```
+
+The most important files you are likely interested in are `results/clean/<sample_name>.fastq.gz`, which are the "cleaned" reads. These are the input reads that *do not* map to the host, control, own fasta or rRNA files (or the subset of these that you provided), plus those reads that map to the "keep" sequence if you used the `--keep` option. Any files that were removed from your input fasta file are placed in `results/removed/<sample_name>.fastq.gz`.
+
+For debugging purposes we also provide various intermediate results in the `intermediate/` folder.
+
 ## Citations
 
 If you use `CLEAN` in your work, please consider citing our preprint:
- 
+
 > Targeted decontamination of sequencing data with CLEAN
 >
 > Marie Lataretu, Sebastian Krautwurst, Adrian Viehweger, Christian Brandt, Martin Hölzer
 >
-> bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089 
+> bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089
 
 Additionally, an extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.