diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 13b406be4..97f223d86 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -18,8 +18,8 @@ If you'd like to write some code for nf-core/eager, the standard workflow is as 1. Check that there isn't already an issue about your idea in the [nf-core/eager issues](https://github.com/nf-core/eager/issues) to avoid duplicating work * If there isn't one already, please create one so that others know you're working on this 2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [nf-core/eager repository](https://github.com/nf-core/eager) to your GitHub account -3. Make the necessary changes / additions within your forked repository (following [code contribution guidelines](https://github.com/nf-core/eager/blob/dev/.github/CONTRIBUTING.md)) -4. Use `nf-core schema build .` and add any new parameters to the pipeline JSON schema (requires nf-core tools >= 1.10). +3. Make the necessary changes / additions within your forked repository following [Pipeline conventions](#pipeline-contribution-conventions) +4. Use `nf-core schema build .` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10). 5. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged If you're not used to this workflow with git, you can start with some [docs from GitHub](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests) or even their [excellent `git` resources](https://try.github.io/). @@ -31,14 +31,14 @@ Typically, pull-requests are only fully reviewed when these tests are passing, t There are typically two types of tests that run: -### Lint Tests +### Lint tests `nf-core` has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to. To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint ` command. If any failures or warnings are encountered, please follow the listed URL for more documentation. -### Pipeline Tests +### Pipeline tests Each `nf-core` pipeline should be set up with a minimal set of test-data. `GitHub Actions` then runs the pipeline on this data to ensure that it exits successfully. @@ -57,19 +57,19 @@ These tests are run both with the latest available version of `Nextflow` and als For further information/help, please consult the [nf-core/eager documentation](https://nf-co.re/eager/usage) and don't hesitate to get in touch on the nf-core Slack [#eager](https://nfcore.slack.com/channels/eager) channel ([join our Slack here](https://nf-co.re/join/slack)). -# Code Contribution Guidelines +## Pipeline contribution conventions -To make the EAGER2 code and processing logic more understandable for new contributors, and to ensure quality. We are making an attempt to somewhat-standardise the way the code is written. +To make the nf-core/eager code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written. -If you wish to contribute a new module, please use the following coding standards. +### Adding a new step -The typical workflow for adding a new module is as follows: +If you wish to contribute a new step, please use the following coding standards: -1. Define the corresponding input channel into your new process from the expected previous process channel (or re-routing block, see below). +1. Define the corresponding input channel into your new process from the expected previous process channel 2. Write the process block (see below). 3. Define the output channel if needed (see below). 4. Add any new flags/options to `nextflow.config` with a default (see below). -5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build .`) +5. Add any new flags/options to `nextflow_schema.json` **with help text** (with `nf-core schema build .`) 6. Add any new flags/options to the help message (for integer/text parameters, print to help the corresponding `nextflow.config` parameter). 7. Add sanity checks for all relevant parameters. 8. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`. @@ -77,16 +77,60 @@ The typical workflow for adding a new module is as follows: 10. Add a new test command in `.github/workflow/ci.yaml`. 11. If applicable add a [MultiQC](https://https://multiqc.info/) module. 12. Update MultiQC config `assets/multiqc_config.yaml` so relevant suffixes, name clean up, General Statistics Table column order, and module figures are in the right order. -13. Add new flags/options to 'usage' documentation under `docs/usage.md`. -14. Add any descriptions of MultiQC report sections and output files to `docs/output.md`. +13. Optional: Add any descriptions of MultiQC report sections and output files to `docs/output.md`. -## Default Values +### Default values -Default values should go in `nextflow.config` under the `params` scope, and `nextflow_schema.json` (latter with `nf-core schema build .`) +Parameters should be initialised / defined with default values in `nextflow.config` under the `params` scope. -## Default resource processes +Once there, use `nf-core schema build .` to add to `nextflow_schema.json`. -Defining recommended 'minimum' resource requirements (CPUs/Memory) for a process should be defined in `conf/base.config`. This can be utilised within the process using `${task.cpu}` or `${task.memory}` variables in the `script:` block. +### Default processes resource requirements + +Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/%7B%7Bcookiecutter.name_noslash%7D%7D/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. + +:warning: Note that in nf-core/eager we currently have our own custom process labels, so please check `base.config`! + +The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block. + +### Naming schemes + +Please use the following naming schemes, to make it easy to understand what is going where. + +* initial process channel: `ch_output_from_` +* intermediate and terminal channels: `ch__for_` +* skipped process output: `ch__for_`(this goes out of the bypass statement described above) + +### Nextflow version bumping + +If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]` + +### Software version reporting + +If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process. + +Add to the script block of the process, something like the following: + +```bash + --version &> v_.txt 2>&1 || true +``` + +or + +```bash + --help | head -n 1 &> v_.txt 2>&1 || true +``` + +You then need to edit the script `bin/scrape_software_versions.py` to: + +1. Add a Python regex for your tool's `--version` output (as in stored in the `v_.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1` +2. Add a HTML entry to the `OrderedDict` for formatting in MultiQC. + +### Images and figures + +For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines). + +For all internal nf-core/eager documentation images we are using the 'Kalam' font by the Indian Type Foundry and licensed under the Open Font License. It can be found for download here [here](https://fonts.google.com/specimen/Kalam). ## Process Concept @@ -164,44 +208,3 @@ if (params.run_fastp) { } ``` - -## Naming Schemes - -Please use the following naming schemes, to make it easy to understand what is going where. - -* process output: `ch_output_from_`(this should always go into the bypass statement described above). -* skipped process output: `ch__for_`(this goes out of the bypass statement described above) -* process inputs: `ch__for_` (this goes into a process) - -## Nextflow Version Bumping - -If you have agreement from reviewers, you may bump the 'default' minimum version of nextflow (e.g. for testing), with `nf-core bump-version`. - -## Software Version Reporting - -If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process. - -Add to the script block of the process, something like the following: - -```bash - --version &> v_.txt 2>&1 || true -``` - -or - -```bash - --help | head -n 1 &> v_.txt 2>&1 || true -``` - -You then need to edit the script `bin/scrape_software_versions.py` to - -1. add a (python) regex for your tools --version output (as in stored in the `v_.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1` -2. add a HTML block entry to the `OrderedDict` for formatting in MultiQC. - -> If a tool does not unfortunately offer any printing of version data, you may add this 'manually' e.g. with `echo "v1.1" > v_.txt` - -## Images and Figures - -For all internal nf-core/eager documentation images we are using the 'Kalam' font by the Indian Type Foundry and licensed under the Open Font License. It can be found for download here [here](https://fonts.google.com/specimen/Kalam). - -For the overview image we follow the nf-core [style guidelines](https://nf-co.re/developers/design_guidelines). diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md index afbcc7c7f..f00ef2e57 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.md +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -15,12 +15,11 @@ Please delete this text and anything that's not relevant from the template below ## Check Documentation -Have you checked in the following places for your error?: +I have checked the following places for your error: -- [ ] [Frequently Asked Questions](https://github.com/nf-core/eager/blob/master/docs/usage.md#troubleshooting-and-faqs) - (for nf-core/eager specific information) -- [ ] [Troubleshooting](https://nf-co.re/usage/troubleshooting) - (for nf-core specific information) +- [ ] [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting) +- [ ] [nf-core/eager pipeline documentation](https://nf-co.re/nf-core/eager/usage) + - nf-core/eager FAQ/troubleshooting can be found [here](https://nf-co.re/eager/usage#troubleshooting-and-faqs) ## Description of the bug @@ -39,9 +38,11 @@ Steps to reproduce the behaviour: ## Log files -1. Command line: -2. The `.nextflow.log` file (which is a hidden file in whichever place you _ran_ the pipeline from - not necessarily in the output directory!) -3. See error: +Have you provided the following extra information/files: + +- [ ] The command used to run the pipeline +- [ ] The `.nextflow.log` file +- [ ] The exact error: ## System diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 0fa8432a3..57a13ac3e 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -13,13 +13,14 @@ Learn more about contributing: [CONTRIBUTING.md](https://github.com/nf-core/eage ## PR checklist -- [ ] This comment contains a description of changes (with reason) +- [ ] This comment contains a description of changes (with reason). - [ ] If you've fixed a bug or added code that should be tested, add tests! -- [ ] If necessary, also make a PR on the [nf-core/eager branch on the nf-core/test-datasets repo](https://github.com/nf-core/test-datasets/pull/new/nf-core/eager) -- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker --paired_end`). -- [ ] Make sure your code lints ([`nf-core lint .`](https://nf-co.re/tools)). -- [ ] Documentation in `docs` is updated -- [ ] `CHANGELOG.md` is updated -- [ ] `README.md` is updated - -**Learn more about contributing:** [CONTRIBUTING.md](https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md) + - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py` + - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md) + - [ ] If necessary, also make a PR on the nf-core/eager _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. +- [ ] Make sure your code lints (`nf-core lint .`). +- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`). +- [ ] Usage Documentation in `docs/usage.md` is updated. +- [ ] Output Documentation in `docs/output.md` is updated. +- [ ] `CHANGELOG.md` is updated. +- [ ] `README.md` is updated (including new tool citations and authors/contributors). diff --git a/.github/markdownlint.yml b/.github/markdownlint.yml index 0967bbbb8..8d7eb53b0 100644 --- a/.github/markdownlint.yml +++ b/.github/markdownlint.yml @@ -1,9 +1,9 @@ # Markdownlint configuration file -default: true, +default: true line-length: false no-duplicate-header: siblings_only: true -no-inline-html: +no-inline-html: allowed_elements: - img - p diff --git a/CHANGELOG.md b/CHANGELOG.md index 08e7bdac8..39aba1bc7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,19 +3,21 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html). -## [2.2.2] - 2020-12-03 +## [2.2.2] - 2020-12-09 ### `Added` -- Added large scale 'stress-test' profile for AWS (using de Barros Damgaard et al. 2018's 137 ancient human genomes) +- Added large scale 'stress-test' profile for AWS (using de Barros Damgaard et al. 2018's 137 ancient human genomes). + - This will now be run automatically for every release. All processed data will be available on the nf-core website: + - You can run this yourself using `-profile test_full` ### `Fixed` - Fixed AWS full test profile. - [#587](https://github.com/nf-core/eager/issues/587) - Re-implemented AdapterRemovalFixPrefix for DeDup compatibility of including singletons -- [#602](https://github.com/nf-core/eager/issues/602) - Added the newly avaliable GATK 3.5 conda package. +- [#602](https://github.com/nf-core/eager/issues/602) - Added the newly available GATK 3.5 conda package. - [#610](https://github.com/nf-core/eager/issues/610) - Create bwa_index channel when specifying circularmapper as mapper -- Updated template to nf-core/tools 1.12 +- Updated template to nf-core/tools 1.12.1 - General documentation improvements ### `Deprecated` diff --git a/Dockerfile b/Dockerfile index e51515584..b9d2d771d 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,4 +1,4 @@ -FROM nfcore/base:1.12 +FROM nfcore/base:1.12.1 LABEL authors="The nf-core/eager community" \ description="Docker image containing all software requirements for the nf-core/eager pipeline" diff --git a/README.md b/README.md index 0565bdd62..0a9675aed 100644 --- a/README.md +++ b/README.md @@ -16,9 +16,10 @@ ## Introduction -**nf-core/eager** is a bioinformatics best-practice analysis pipeline for NGS sequencing based ancient DNA (aDNA) data analysis. + +**nf-core/eager** is a bioinformatics best-practise analysis pipeline for NGS sequencing based ancient DNA (aDNA) data analysis. -The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. The pipeline pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results. It comes with docker, singularity or conda containers making installation trivial and results highly reproducible. +The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible. The pipeline pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results. It comes with docker, singularity or conda containers making installation trivial and results highly reproducible.

nf-core/eager schematic workflow + nf-core/eager metro map + ## Quick Start 1. Install [`nextflow`](https://nf-co.re/usage/installation) (version >= 20.04.0) @@ -113,6 +122,15 @@ See [usage docs](https://nf-co.re/eager/docs/usage.md) for all of the available Modifications to the default pipeline are easily made using various options as described in the documentation. +## Pipeline Summary + +By default, the pipeline currently performs the following: + + + +* Sequencing quality control (`FastQC`) +* Overall pipeline run summaries (`MultiQC`) + ## Documentation The nf-core/eager pipeline comes with documentation about the pipeline: [usage](https://nf-co.re/eager/usage) and [output](https://nf-co.re/eager/output). @@ -130,9 +148,8 @@ The nf-core/eager pipeline comes with documentation about the pipeline: [usage]( This pipeline was mostly written by Alexander Peltzer ([apeltzer](https://github.com/apeltzer)) and [James A. Fellows Yates](https://github.com/jfy133), with contributions from [Stephen Clayton](https://github.com/sc13-bioinf), [Thiseas C. Lamnidis](https://github.com/TCLamnidis), [Maxime Borry](https://github.com/maxibor), [Zandra Fagernäs](https://github.com/ZandraFagernas), [Aida Andrades Valtueña](https://github.com/aidaanva) and [Maxime Garcia](https://github.com/MaxUlysse) and the nf-core community. -If you would like to contribute to this pipeline, please open an issue (or even better, a pull request - please see the [contributing guidelines](.github/CONTRIBUTING.md), and ask to be added to the project - everyone is welcome to contribute here!. - -For further information or help, don't hesitate to get in touch on the [Slack `#eager` channel](https://nfcore.slack.com/channels/eager) (you can join with [this invite](https://nf-co.re/join/slack)). +We thank the following people for their extensive assistance in the development +of this pipeline: ## Authors (alphabetical) @@ -166,7 +183,29 @@ Those who have provided conceptual guidance, suggestions, bug reports etc. If you've contributed and you're missing in here, please let us know and we will add you in of course! -## Tool References +## Contributions and Support + +If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md). + +For further information or help, don't hesitate to get in touch on the [Slack `#eager` channel](https://nfcore.slack.com/channels/eager) (you can join with [this invite](https://nf-co.re/join/slack)). + +## Citations + +If you use `nf-core/eager` for your analysis, please cite the `eager` preprint as follows: +> James A. Fellows Yates, Thiseas Christos Lamnidis, Maxime Borry, Aida Andrades Valtueña, Zandra Fagneräs, Stephen Clayton, Maxime U. Garcia, Judith Neukamm, Alexander Peltzer **Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager** bioRxiv 2020.06.11.145615; [doi: https://doi.org/10.1101/2020.06.11.145615](https://doi.org/10.1101/2020.06.11.145615) + +You can cite the eager zenodo record for a specific version using the following [doi: 10.5281/zenodo.3698082](https://zenodo.org/badge/latestdoi/135918251) + +You can cite the `nf-core` publication as follows: + +> **The nf-core framework for community-curated bioinformatics pipelines.** +> +> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. +> +> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). +> ReadCube: [Full Access Link](https://rdcu.be/b1GjZ) + +In addition, references of tools and data used in this pipeline are as follows: * **EAGER v1**, CircularMapper, DeDup* Peltzer, A., Jäger, G., Herbig, A., Seitz, A., Kniep, C., Krause, J., & Nieselt, K. (2016). EAGER: efficient ancient genome reconstruction. Genome Biology, 17(1), 1–14. [https://doi.org/10.1186/s13059-016-0918-z](https://doi.org/10.1186/s13059-016-0918-z). Download: [https://github.com/apeltzer/EAGER-GUI](https://github.com/apeltzer/EAGER-GUI) and [https://github.com/apeltzer/EAGER-CLI](https://github.com/apeltzer/EAGER-CLI) * **FastQC** Download: [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) @@ -205,19 +244,3 @@ This repository uses test data from the following studies: * Fellows Yates, J. A. et al. (2017) ‘Central European Woolly Mammoth Population Dynamics: Insights from Late Pleistocene Mitochondrial Genomes’, Scientific reports, 7(1), p. 17714. [doi: 10.1038/s41598-017-17723-1](https://doi.org/10.1038/s41598-017-17723-1). * Gamba, C. et al. (2014) ‘Genome flux and stasis in a five millennium transect of European prehistory’, Nature communications, 5, p. 5257. [doi: 10.1038/ncomms6257](https://doi.org/10.1038/ncomms6257). * Star, B. et al. (2017) ‘Ancient DNA reveals the Arctic origin of Viking Age cod from Haithabu, Germany’, Proceedings of the National Academy of Sciences of the United States of America, 114(34), pp. 9152–9157. [doi: 10.1073/pnas.1710186114](https://doi.org/10.1073/pnas.1710186114). - -## Citation - -If you use `nf-core/eager` for your analysis, please cite the `eager` preprint as follows: -> James A. Fellows Yates, Thiseas Christos Lamnidis, Maxime Borry, Aida Andrades Valtueña, Zandra Fagneräs, Stephen Clayton, Maxime U. Garcia, Judith Neukamm, Alexander Peltzer **Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager** bioRxiv 2020.06.11.145615; [doi: https://doi.org/10.1101/2020.06.11.145615](https://doi.org/10.1101/2020.06.11.145615) - -You can cite the eager zenodo record for a specific version using the following [doi: 10.5281/zenodo.3698082](https://zenodo.org/badge/latestdoi/135918251) - -You can cite the `nf-core` publication as follows: - -> **The nf-core framework for community-curated bioinformatics pipelines.** -> -> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. -> -> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). -> ReadCube: [Full Access Link](https://rdcu.be/b1GjZ) diff --git a/assets/nf-core-eager_logo.png b/assets/nf-core-eager_logo.png index f464b6b3a..4d301d806 100644 Binary files a/assets/nf-core-eager_logo.png and b/assets/nf-core-eager_logo.png differ diff --git a/conf/igenomes.config b/conf/igenomes.config index caeafceb2..31b7ee613 100644 --- a/conf/igenomes.config +++ b/conf/igenomes.config @@ -21,7 +21,7 @@ params { readme = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt" mito_name = "MT" macs_gsize = "2.7e9" - blacklist = "${baseDir}/assets/blacklists/GRCh37-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/GRCh37-blacklist.bed" } 'GRCh38' { fasta = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa" @@ -33,7 +33,7 @@ params { bed12 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed" mito_name = "chrM" macs_gsize = "2.7e9" - blacklist = "${baseDir}/assets/blacklists/hg38-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" } 'GRCm38' { fasta = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/WholeGenomeFasta/genome.fa" @@ -46,7 +46,7 @@ params { readme = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/README.txt" mito_name = "MT" macs_gsize = "1.87e9" - blacklist = "${baseDir}/assets/blacklists/GRCm38-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/GRCm38-blacklist.bed" } 'TAIR10' { fasta = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome.fa" @@ -270,7 +270,7 @@ params { bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.bed" mito_name = "chrM" macs_gsize = "2.7e9" - blacklist = "${baseDir}/assets/blacklists/hg38-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" } 'hg19' { fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa" @@ -283,7 +283,7 @@ params { readme = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/README.txt" mito_name = "chrM" macs_gsize = "2.7e9" - blacklist = "${baseDir}/assets/blacklists/hg19-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/hg19-blacklist.bed" } 'mm10' { fasta = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa" @@ -296,7 +296,7 @@ params { readme = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/README.txt" mito_name = "chrM" macs_gsize = "1.87e9" - blacklist = "${baseDir}/assets/blacklists/mm10-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/mm10-blacklist.bed" } 'bosTau8' { fasta = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/WholeGenomeFasta/genome.fa" diff --git a/docs/images/nf-core-eager_logo.png b/docs/images/nf-core-eager_logo.png index 744089106..4d301d806 100644 Binary files a/docs/images/nf-core-eager_logo.png and b/docs/images/nf-core-eager_logo.png differ diff --git a/docs/output.md b/docs/output.md index 8c9240be3..7c316e7db 100644 --- a/docs/output.md +++ b/docs/output.md @@ -4,82 +4,6 @@ > _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ -## Table of contents - - - -- [nf-core/eager: Output](#nf-coreeager-output) - - [:warning: Please read this documentation on the nf-core website: https://nf-co.re/eager/output](#warning-please-read-this-documentation-on-the-nf-core-website-httpsnf-coreeageroutput) - - [Table of contents](#table-of-contents) - - [Introduction](#introduction) - - [Directory Structure](#directory-structure) - - [Primary Output Directories](#primary-output-directories) - - [Secondary Output Directories](#secondary-output-directories) - - [MultiQC Report](#multiqc-report) - - [General Stats Table](#general-stats-table) - - [Background](#background) - - [Table](#table) - - [FastQC](#fastqc) - - [Background](#background-1) - - [Sequence Counts](#sequence-counts) - - [Sequence Quality Histograms](#sequence-quality-histograms) - - [Per Sequence Quality Scores](#per-sequence-quality-scores) - - [Per Base Sequencing Content](#per-base-sequencing-content) - - [Per Sequence GC Content](#per-sequence-gc-content) - - [Per Base N Content](#per-base-n-content) - - [Sequence Duplication Levels](#sequence-duplication-levels) - - [Overrepresented sequences](#overrepresented-sequences) - - [Adapter Content](#adapter-content) - - [FastP](#fastp) - - [Background](#background-2) - - [GC Content](#gc-content) - - [AdapterRemoval](#adapterremoval) - - [Background](#background-3) - - [Retained and Discarded Reads Plot](#retained-and-discarded-reads-plot) - - [Length Distribution Plot](#length-distribution-plot) - - [Bowtie2](#bowtie2) - - [Background](#background-4) - - [Single/Paired-end alignments](#singlepaired-end-alignments) - - [MALT](#malt) - - [Background](#background-5) - - [Metagenomic Mappability](#metagenomic-mappability) - - [Taxonomic assignment success](#taxonomic-assignment-success) - - [Kraken](#kraken) - - [Background](#background-6) - - [Top Taxa](#top-taxa) - - [Samtools](#samtools) - - [Background](#background-7) - - [Flagstat Plot](#flagstat-plot) - - [DeDup](#dedup) - - [Background](#background-8) - - [DeDup Plot](#dedup-plot) - - [Picard](#picard) - - [Background](#background-9) - - [Mark Duplicates](#mark-duplicates) - - [Preseq](#preseq) - - [Background](#background-10) - - [Complexity Curve](#complexity-curve) - - [DamageProfiler](#damageprofiler) - - [Background](#background-11) - - [Misincorporation Plots](#misincorporation-plots) - - [Length Distribution](#length-distribution) - - [QualiMap](#qualimap) - - [Background](#background-12) - - [Coverage Histogram](#coverage-histogram) - - [Cumulative Genome Coverage](#cumulative-genome-coverage) - - [GC Content Distribution](#gc-content-distribution) - - [Sex.DetERRmine](#sexdeterrmine) - - [Background](#background-13) - - [Relative Coverage](#relative-coverage) - - [Read Counts](#read-counts) - - [MultiVCFAnalyzer](#multivcfanalyzer) - - [Background](#background-14) - - [Summary metrics](#summary-metrics) - - [Call statistics barplot](#call-statistics-barplot) - - [Output Files](#output-files) - - - ## Introduction The output of nf-core/eager primarily consists of the following main components: output alignment files (e.g. VCF, BAM or FASTQ files), and summary statistics of the whole run presented in a [`MultiQC`](https://multiqc.info) report. Intermediate files and module-specific statistics files are also retained depending on your particular run configuration. diff --git a/docs/usage.md b/docs/usage.md index 78c55cddf..b5094f606 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -4,68 +4,6 @@ > _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ -## Table of contents - - - -- [:warning: Please read this documentation on the nf-core website: https://nf-co.re/eager/usage](#warning-please-read-this-documentation-on-the-nf-core-website-httpsnf-coreeagerusage) -- [Table of contents](#table-of-contents) -- [Introduction](#introduction) -- [Running the pipeline](#running-the-pipeline) - - [Quick Start](#quick-start) - - [Updating the pipeline](#updating-the-pipeline) - - [Reproducibility](#reproducibility) - - [Automatic Resubmission](#automatic-resubmission) -- [Core Nextflow arguments](#core-nextflow-arguments) - - [-profile](#-profile) - - [-resume](#-resume) - - [-c](#-c) - - [Running in the background](#running-in-the-background) -- [Pipeline Options](#pipeline-options) - - [Input](#input) - - [Input Data Additional Options](#input-data-additional-options) - - [Reference Genomes](#reference-genomes) - - [Output](#output) - - [Other run specific parameters](#other-run-specific-parameters) - - [Step skipping parameters](#step-skipping-parameters) - - [Complexity Filtering Options](#complexity-filtering-options) - - [Adapter Clipping and Merging Options](#adapter-clipping-and-merging-options) - - [Read Mapping Parameters](#read-mapping-parameters) - - [Removal of Host-Mapped Reads](#removal-of-host-mapped-reads) - - [Read Filtering and Conversion Parameters](#read-filtering-and-conversion-parameters) - - [Read DeDuplication Parameters](#read-deduplication-parameters) - - [Library Complexity Estimation Parameters](#library-complexity-estimation-parameters) - - [DNA Damage Assessment Parameters](#dna-damage-assessment-parameters) - - [Feature Annotation Statistics](#feature-annotation-statistics) - - [BAM Trimming Parameters](#bam-trimming-parameters) - - [Genotyping Parameters](#genotyping-parameters) - - [Consensus Sequence Generation](#consensus-sequence-generation) - - [SNP Table Generation](#snp-table-generation) - - [Mitochondrial to Nuclear Ratio](#mitochondrial-to-nuclear-ratio) - - [Human Sex Determination](#human-sex-determination) - - [Human Nuclear Contamination](#human-nuclear-contamination) - - [Metagenomic Screening](#metagenomic-screening) - - [Metagenomic Authentication](#metagenomic-authentication) - - [Clean up](#clean-up) -- [Troubleshooting and FAQs](#troubleshooting-and-faqs) - - [My pipeline update doesn't seem to do anything](#my-pipeline-update-doesnt-seem-to-do-anything) - - [Input files not found](#input-files-not-found) - - [I am only getting output for a single sample although I specified multiple with wildcards](#i-am-only-getting-output-for-a-single-sample-although-i-specified-multiple-with-wildcards) - - [The pipeline crashes almost immediately with an early pipeline step](#the-pipeline-crashes-almost-immediately-with-an-early-pipeline-step) - - [The pipeline has crashed with an error but Nextflow is still running](#the-pipeline-has-crashed-with-an-error-but-nextflow-is-still-running) - - [I get a exceeded job memory limit error](#i-get-a-exceeded-job-memory-limit-error) - - [I get a file name collision error during merging](#i-get-a-file-name-collision-error-during-merging) - - [I specified a module and it didn't produce the expected output](#i-specified-a-module-and-it-didnt-produce-the-expected-output) - - [I get a unable to acquire lock](#i-get-a-unable-to-acquire-lock) -- [Tutorials](#tutorials) - - [Tutorial - How to investigate a failed run](#tutorial---how-to-investigate-a-failed-run) - - [Tutorial - What are Profiles and How To Use Them](#tutorial---what-are-profiles-and-how-to-use-them) - - [Tutorial - How to set up nf-core/eager for human population genetics](#tutorial---how-to-set-up-nf-coreeager-for-human-population-genetics) - - [Tutorial - How to set up nf-core/eager for metagenomic screening](#tutorial---how-to-set-up-nf-coreeager-for-metagenomic-screening) - - [Tutorial - How to set up nf-core/eager for pathogen genomics](#tutorial---how-to-set-up-nf-coreeager-for-pathogen-genomics) - - - ## Introduction ## Running the pipeline @@ -305,50 +243,31 @@ We recommend adding the following line to your environment to limit this (typica NXF_OPTS='-Xms1g -Xmx4g' ``` -## Pipeline Options - -### Input +## Input Specifications -#### `--input` +There are two possible ways of supplying input sequencing data to nf-core/eager. The most efficient but more simplistic is supplying direct paths (with wildcards) to your FASTQ or BAM files, with each file or pair being considered a single library and each one run independently. TSV input requires creation of an extra file by the user and extra metadata, but allows more powerful lane and library merging. -There are two possible ways of supplying input sequencing data to nf-core/eager. -The most efficient but more simplistic is supplying direct paths (with -wildcards) to your FASTQ or BAM files, with each file or pair being considered a -single library and each one run independently. TSV input requires creation of an -extra file by the user and extra metadata, but allows more powerful lane and -library merging. +### Direct Input Method -##### Direct Input Method +This method is where you specify with `--input`, the path locations of FASTQ (optionally gzipped) or BAM file(s). This option is mutually exclusive to the [TSV input method](#tsv-input-method), which is used for more complex input configurations such as lane and library merging. -This method is where you specify with `--input`, the path locations of FASTQ -(optionally gzipped) or BAM file(s). This option is mutually exclusive to the -[TSV input method](#tsv-input-method), which is used for more complex input -configurations such as lane and library merging. +When using the direct method of `--input` you can specify one or multiple samples in one or more directories files. File names **must be unique**, even if in different directories. -When using the direct method of `--input` you can specify one or multiple -samples in one or more directories files. File names **must be unique**, even if -in different directories. +By default, the pipeline _assumes_ you have paired-end data. If you want to run single-end data you must specify [`--single_end`]('#single_end') -By default, the pipeline _assumes_ you have paired-end data. If you want to run -single-end data you must specify [`--single_end`]('#single_end') - -For example, for a single set of FASTQs, or multiple paired-end FASTQ -files in one directory, you can specify: +For example, for a single set of FASTQs, or multiple paired-end FASTQ files in one directory, you can specify: ```bash --input 'path/to/data/sample_*_{1,2}.fastq.gz' ``` -If you have multiple files in different directories, you can use additional -wildcards (`*`) e.g.: +If you have multiple files in different directories, you can use additional wildcards (`*`) e.g.: ```bash --input 'path/to/data/*/sample_*_{1,2}.fastq.gz' ``` -> :warning: It is not possible to run a mixture of single-end and paired-end -> files in one run with the paths `--input` method! Please see the [TSV input -> method](#tsv-input-method) for possibilities. +> :warning: It is not possible to run a mixture of single-end and paired-end files in one run with the paths `--input` method! Please see the [TSV input method](#tsv-input-method) for possibilities. **Please note** the following requirements: @@ -357,37 +276,20 @@ wildcards (`*`) e.g.: 3. The path must have at least one `*` wildcard character 4. When using the pipeline with **paired end data**, the path must use `{1,2}` notation to specify read pairs. -5. Files names must be unique, having files with the same name, but in different - directories is _not_ sufficient - - This can happen when a library has been sequenced across two sequencers on - the same lane. Either rename the file, try a symlink with a unique name, or - merge the two FASTQ files prior input. -6. Due to limitations of downstream tools (e.g. FastQC), sample IDs may be - truncated after the first `.` in the name, Ensure file names are unique prior - to this! -7. For input BAM files you should provide a small decoy reference genome with - pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory - parameter `--fasta` in order to avoid long computational time for generating - the index files of the reference genome, even if you do not actual need a - reference genome for any downstream analyses. - -##### TSV Input Method - -Alternatively to the [direct input method](#direct-input-method), you can supply -to `--input` a path to a TSV file that contains paths to FASTQ/BAM files and -additional metadata. This allows for more complex procedures such as merging of -sequencing data across lanes, sequencing runs, sequencing configuration types, -and samples. +5. Files names must be unique, having files with the same name, but in different directories is _not_ sufficient + - This can happen when a library has been sequenced across two sequencers on the same lane. Either rename the file, try a symlink with a unique name, or merge the two FASTQ files prior input. +6. Due to limitations of downstream tools (e.g. FastQC), sample IDs may be truncated after the first `.` in the name, Ensure file names are unique prior to this! +7. For input BAM files you should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses. + +### TSV Input Method + +Alternatively to the [direct input method](#direct-input-method), you can supply to `--input` a path to a TSV file that contains paths to FASTQ/BAM files and additional metadata. This allows for more complex procedures such as merging of sequencing data across lanes, sequencing runs, sequencing configuration types, and samples.

- Schematic diagram indicating merging points of different types of libraries, given a TSV input. Dashed boxes are optional library-specific processes + Schematic diagram indicating merging points of different types of libraries, given a TSV input. Dashed boxes are optional library-specific processes

-The use of the TSV `--input` method is recommended when performing -more complex procedures such as lane or library merging. You do not need to -specify `--single_end`, `--bam`, `--colour_chemistry`, `-udg_type` etc. when -using TSV input - this is defined within the TSV file itself. You can only -supply a single TSV per run (i.e. `--input '*.tsv'` will not work). +The use of the TSV `--input` method is recommended when performing more complex procedures such as lane or library merging. You do not need to specify `--single_end`, `--bam`, `--colour_chemistry`, `-udg_type` etc. when using TSV input - this is defined within the TSV file itself. You can only supply a single TSV per run (i.e. `--input '*.tsv'` will not work). This TSV should look like the following: @@ -399,52 +301,23 @@ This TSV should look like the following: A template can be taken from [here](https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/TSV_template.tsv). -> :warning: Cells **must not** contain spaces before or after strings, as this -> will make the TSV unreadable by Nextflow. Strings containing spaces should be -> wrapped in quotes. +> :warning: Cells **must not** contain spaces before or after strings, as this will make the TSV unreadable by nextflow. Strings containing spaces should be wrapped in quotes. -When using TSV_input, nf-core/eager will merge FASTQ files of libraries with the -same `Library_ID` but different `Lanes` values after adapter clipping (and -merging), assuming all other metadata columns are the same. If you have the same -`Library_ID` but with different `SeqType`, this will be merged directly after -mapping prior BAM filtering. Finally, it will also merge BAM files with the same -`Sample_ID` but different `Library_ID` after duplicate removal, but prior to -genotyping. Please see caveats to this below. +When using TSV_input, nf-core/eager will merge FASTQ files of libraries with the same `Library_ID` but different `Lanes` values after adapter clipping (and merging), assuming all other metadata columns are the same. If you have the same `Library_ID` but with different `SeqType`, this will be merged directly after mapping prior BAM filtering. Finally, it will also merge BAM files with the same `Sample_ID` but different `Library_ID` after duplicate removal, but prior to genotyping. Please see caveats to this below. Column descriptions are as follows: -- **Sample_Name:** A text string containing the name of a given sample of which - there can be multiple libraries. All libraries with the same sample name and - same SeqType will be merged after deduplication. -- **Library_ID:** A text string containing a given library, which there can be - multiple sequencing lanes (with the same SeqType). -- **Lane:** A number indicating which lane the library was sequenced on. Files - from the libraries sequenced on different lanes (and different SeqType) will - be concatenated after read clipping and merging. -- **Colour Chemistry** A number indicating whether the Illumina sequencer the - library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour - chemistry machine. This informs whether poly-G trimming (if turned on) should - be performed. -- **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with - both an R1 [or forward] and R2 [or reverse]) and single end data (only R1 - [forward], or BAM). This will affect lane merging if different per library. -- **Organism:** A text string of the organism name of the sample or 'NA'. This - currently has no functionality and can be set to 'NA', but will affect - lane/library merging if different per library -- **Strandedness:** A text string indicating whether the library type is - 'single' or 'double'. This will affect lane/library merging if different per - library. -- **UDG_Treatment:** A text string indicating whether the library was generated - with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library - merging if different per library. -- **R1:** A text string of a file path pointing to a forward or R1 FASTQ file. - This can be used with the R2 column. File names **must be unique**, even if - they are in different directories. -- **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file, - or 'NA' when single end data. This can be used with the R1 column. File names - **must be unique**, even if they are in different directories. -- **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot - be specified at the same time as R1 or R2, both of which should be set to 'NA' +- **Sample_Name:** A text string containing the name of a given sample of which there can be multiple libraries. All libraries with the same sample name and same SeqType will be merged after deduplication. +- **Library_ID:** A text string containing a given library, which there can be multiple sequencing lanes (with the same SeqType). +- **Lane:** A number indicating which lane the library was sequenced on. Files from the libraries sequenced on different lanes (and different SeqType) will be concatenated after read clipping and merging. +- **Colour Chemistry** A number indicating whether the Illumina sequencer the library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour chemistry machine. This informs whether poly-G trimming (if turned on) should be performed. +- **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with both an R1 [or forward] and R2 [or reverse]) and single end data (only R1 [forward], or BAM). This will affect lane merging if different per library. +- **Organism:** A text string of the organism name of the sample or 'NA'. This currently has no functionality and can be set to 'NA', but will affect lane/library merging if different per library +- **Strandedness:** A text string indicating whether the library type is'single' or 'double'. This will affect lane/library merging if different per library. +- **UDG_Treatment:** A text string indicating whether the library was generated with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library merging if different per library. +- **R1:** A text string of a file path pointing to a forward or R1 FASTQ file. This can be used with the R2 column. File names **must be unique**, even if they are in different directories. +- **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file, or 'NA' when single end data. This can be used with the R1 column. File names **must be unique**, even if they are in different directories. +- **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot be specified at the same time as R1 or R2, both of which should be set to 'NA' For example, the following TSV table: @@ -457,1689 +330,34 @@ For example, the following TSV table: will have the following effects: -- After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8 - _with the same `SeqType`_ (and all other _metadata_ columns) will be - concatenated together for each **Library**. -- After mapping, and prior BAM filtering, BAM files with different - `SeqType` (but with all other metadata columns the same) will be merged - together for each **Library**. -- After duplicate removal, BAM files with `Library_ID`s with the same - `Sample_Name` and the same `UDG_Treatment` will be merged together. -- If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and - half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the - same `Sample_Name`. +- After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8 _with the same `SeqType`_ (and all other _metadata_ columns) will be concatenated together for each **Library**. +- After mapping, and prior BAM filtering, BAM files with different `SeqType` (but with all other metadata columns the same) will be merged together for each **Library**. +- After duplicate removal, BAM files with different `Library_ID`s but with the same `Sample_Name` and the same `UDG_Treatment` will be merged together. +- If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the same `Sample_Name`. Note the following important points and limitations for setting up: - The TSV must use actual tabs (not spaces) between cells. -- *File* names must be unique regardless of file path, due to risk of - over-writing (see: - [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)). - - If it is 'too late' and you already have duplicate file names, a workaround is - to concatenate the FASTQ files together and supply this to a nf-core/eager - run. The only downside is that you will not get independent FASTQC results - for each file. +- *File* names must be unique regardless of file path, due to risk of over-writing (see: [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)). + - If it is 'too late' and you already have duplicate file names, a workaround is to concatenate the FASTQ files together and supply this to a nf-core/eager run. The only downside is that you will not get independent FASTQC results for each file. - Lane IDs must be unique for each sequencing of each library. - - If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can - give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will - still be processed correctly. - - This also applies to the SeqType column, i.e. with the example above, if one - run is PE and one run is SE, you need to give fake lane IDs to one of the - runs as well. + - If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will still be processed correctly. + - This also applies to the SeqType column, i.e. with the example above, if one run is PE and one run is SE, you need to give fake lane IDs to one of the runs as well. - All _BAM_ files must be specified as `SE` under `SeqType`. - - You should provide a small decoy reference genome with pre-made indices, e.g. - the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in - order to avoid long computational time for generating the index files of the - reference genome, even if you do not actual need a reference genome for any - downstream analyses. -- nf-core/eager will only merge multiple _lanes_ of sequencing runs with the - same single-end or paired-end configuration -- Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files - (unless you use `--run_convertbam`), as only FASTQ files are lane-merged - together. -- Same libraries that are sequenced on different sequencing configurations (i.e - single- and paired-end data), will be merged after mapping and will _always_ - be considered 'paired-end' during downstream processes - - **Important** running DeDup in this context is _not_ recommended, as PE and - SE data at the same position will _not_ be evaluated as duplicates. - Therefore not all duplicates will be removed. - - When you wish to run PE/SE data together `-dedupper markduplicates` is - therefore preferred. - - An error will be thrown if you try to merge both PE and SE and also supply - `--skip_merging`. - - If you truly want to mix SE data and PE data but using mate-pair info for PE - mapping, please run FASTQ preprocessing mapping manually and supply BAM - files for downstream processing by nf-core/eager - - If you _regularly_ want to run the situation above, please leave a feature - request on github. -- DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on - each unique library separately after deduplication (but prior same-treated - library merging). -- nf-core/eager functionality such as `--run_trim_bam` will be applied to only - non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. -- Qualimap is run on each sample, after merging of libraries (i.e. your values - will reflect the values of all libraries combined - after being damage trimmed - etc.). -- Genotyping will be typically performed on each `sample` independently, as - normally all libraries will have been merged together. However, if you have a - mixture of single-stranded and double-stranded libraries, you will normally - need to genotype separately. In this case you **must** give each the SS and DS - libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file - collision` error in steps such as `sexdeterrmine`, and then you will need to - merge these yourself. We will consider changing this behaviour in the future - if there is enough interest. - -#### `--udg_type` - -Defines whether Uracil-DNA glycosylase (UDG) treatment was used to remove DNA -damage on the sequencing libraries. - -Specify `'none'` if no treatment was performed. If you have partial UDG treated -data ([Rohland et al 2016](http://dx.doi.org/10.1098/rstb.2013.0624)), specify -`'half'`. If you have complete UDG treated data ([Briggs et al. -2010](https://doi.org/10.1093/nar/gkp1163)), specify `'full'`. - -When also using PMDtools specifying `'half'` will use a different model for DNA -damage assessment in PMDTools (PMDtools: `--UDGhalf`). Specify `'full'` and the -PMDtools DNA damage assessment will use CpG context only (PMDtools: `--CpG`). -Default: `'none'`. - -> **Tip**: You should provide a small decoy reference genome with pre-made indices, e.g. -> the human mtDNA genome, for the mandatory parameter `--fasta` in order to -> avoid long computational time for generating the index files of the reference -> genome, even if you do not actual need a reference genome for any downstream -> analyses. - -#### `--single_stranded` - -Indicates libraries are single stranded. - -Currently only affects MALTExtract, where it will switch on damage patterns -calculation mode to single-stranded, (MaltExtract: `--singleStranded`) and -genotyping with pileupCaller where a different method is used (pileupCaller: -`--singleStrandMode`). Default: false. - -Only required when using the 'Path' method of [`--input`](#--input). - -#### `--single_end` - -Indicates libraries were sequenced with single-end sequencing chemistries (i.e. -only a R1 file is present). If not supplied, input data assumed to be paired-end -by default. - -Only required when using the 'Path' method of [`--input`](#--input). - -#### `--colour_chemistry` - -Specifies which Illumina colour chemistry a library was sequenced with. This -informs whether to perform poly-G trimming (if `--complexity_filter_poly_g` is -also supplied). Only 2 colour chemistry sequencers (e.g. NextSeq or NovaSeq) can -generate uncertain poly-G tails (due to 'G' being indicated via a no-colour -detection). Default is '4' to indicate e.g. HiSeq or MiSeq platforms, which do -not require poly-G trimming. Options: 2, 4. Default: 4 - -Only required when using the 'Path' method of [`--input`](#--input). - -#### `--bam` - -Specifies the input file type to `--input` is in BAM format. This will -automatically also apply `--single_end`. - -Only required when using the 'Path' method of [`--input`](#--input). - -### Input Data Additional Options - -#### `--snpcapture_bed` - -Can be used to set a path to a BED file (3/6 column format) of SNP positions of -a reference genome, to calculate SNP captured libraries on-target efficiency. -This should be used for array or in-solution SNP capture protocols such as 390K, -1240K, etc. If supplied, on-target metrics are automatically generated for you -by qualimap. - -#### `--run_convertinputbam` - -Allows you to convert an input BAM file back to FASTQ for downstream processing. -Note this is required if you need to perform AdapterRemoval and/or polyG -clipping. - -If not turned on, BAMs will automatically be sent to post-mapping steps. - -### Reference Genomes - -All nf-core/eager runs require a reference genome in FASTA format to map reads -against to. - -In addition we provide various options for indexing of different types of -reference genomes (based on the tools used in the pipeline). nf-core/eager can -index reference genomes for you (with options to save these for other analysis), -but you can also supply your pre-made indices. - -Supplying pre-made indices saves time in pipeline execution and is especially -advised when running multiple times on the same cluster system, for example. You -can even add a resource [specific profile](#profile) that sets paths to -pre-computed reference genomes, saving time when specifying these. - -> :warning: you must always supply a reference file. If you want to use -> functionality that does not require one, supply a small decoy genome such as -> phiX or the human mtDNA genome. - -#### `--fasta` - -You specify the full path to your reference genome here. The FASTA file can have -any file suffix, such as `.fasta`, `.fna`, `.fa`, `.FastA` etc. You may also -supply a gzipped reference files, which will be unzipped automatically for you. - -For example: - -```bash ---fasta '///my_reference.fasta' -``` - -You need to provide an input FASTA even if you do not do any mapping (e.g. -supplying BAM files). You should use a small decoy reference genome with pre-made -indices, e.g. the human mtDNA genome, for the mandatory parameter `--fasta` in -order to avoid long computational time for generating the index files of the -reference genome. - -> If you don't specify appropriate `--bwa_index`, `--fasta_index` parameters -> (see [below](#optional-reference-options)), the pipeline will create these -> indices for you automatically. Note that you can save the indices created for -> you for later by giving the `--save_reference` flag. You must select either a -> `--fasta` or `--genome` - -#### `--genome` (using iGenomes) - -Alternatively, the pipeline config files come bundled with paths to the Illumina -iGenomes reference index files. If running with Docker or AWS, the configuration -is set up to use the [AWS-iGenomes](https://ewels.github.io/AWS-iGenomes/) -resource. - -There are 31 different species supported in the iGenomes references. To run the -pipeline, you must specify which to use with the `--genome` flag. - -You can find the keys to specify the genomes in the iGenomes config file under -`conf/` on the nf-core/eager [GitHub -repository](https://github.com/nf-core/eager). Common genomes that are supported -are: - -- Human - - `--genome GRCh37` - - `--genome GRCh38` -- Mouse * - - `--genome GRCm38` -- _Drosophila_ * - - `--genome BDGP6` -- _S. cerevisiae_ * - - `--genome 'R64-1-1'` - -> \* Not bundled with nf-core eager by default. - -Note that you can use the same configuration set-up to save sets of reference -files for your own use, even if they are not part of the iGenomes resource. See -the [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) -for instructions on where to save such a file. - -Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process `star` 32GB of memory, you could use the following config: - -```nextflow -params { - genomes { - 'GRCh37' { - fasta = '' - } - // Any number of additional genomes, key is used with --genome - } -} -``` - -> You must select either a `--fasta` or `--genome` - -#### `--bwa_index` - -If you want to use pre-existing `bwa index` indices, please supply the -**directory** to the FASTA you also specified in `--fasta` (see above). -nf-core/eager will automagically detect the index files by searching for the -FASTA filename with the corresponding `bwa` index file suffixes. - -> :warning: pre-built indices must currently be built on non-gzipped FASTA files -> due to limitations of `samtools`. However once indices have been built, you -> can re-gzip the FASTA file as nf-core will unzip this particular file for you. - -For example: - -```bash -nextflow run nf-core/eager \ --profile test,docker \ ---input '*{R1,R2}*.fq.gz' ---fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \ ---bwa_index 'results/reference_genome/bwa_index/BWAIndex/' -``` - -> `bwa index` does not give you an option to supply alternative suffixes/names -> for these indices. Thus, the file names generated by this command _must not_ -> be changed, otherwise nf-core/eager will not be able to find them. - -#### `--bt2_index` - -If you want to use pre-existing `bt2 index` indices, please supply the -**directory** to the FASTA you also specified in `--fasta` (see above). -nf-core/eager will automagically detect the index files by searching for the -FASTA filename with the corresponding `bt2` index file suffixes. - -> :warning: pre-built indices must currently be built on non-gzipped FASTA files -> due to limitations of `samtools`. However once indices have been built, you -> can re-gzip the FASTA file as nf-core will unzip this particular file for you. - -For example: - -```bash -nextflow run nf-core/eager \ --profile test,docker \ ---input '*{R1,R2}*.fq.gz' ---fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \ ---bwa_index 'results/reference_genome/bt2_index/BT2Index/' -``` - -> `bowtie2-build` does not give you an option to supply alternative -> suffixes/names for these indices. Thus, the file names generated by this -> command _must not_ be changed, otherwise nf-core/eager will not be able to -> find them. - -#### `--fasta_index` - -If you want to use a pre-existing `samtools faidx` index, use this to specify -the required FASTA index file for the selected reference genome. This should be -generated by `samtools faidx` and has a file suffix of `.fai` - -For example: - -```bash ---fasta_index 'Mammoth_MT_Krause.fasta.fai' -``` - -#### `--seq_dict` - -If you want to use a pre-existing `picard CreateSequenceDictionary` dictionary -file, use this to specify the required `.dict` file for the selected reference -genome. - -> :warning: pre-built indices must currently be built on non-gzipped FASTA files -> due to limitations of `samtools`. However once indices have been built, you -> can re-gzip the FASTA file as nf-core will unzip this particular file for you. - -For example: - -```bash ---seq_dict 'Mammoth_MT_Krause.dict' -``` - -#### `--large_ref` - -This parameter is required to be set for large reference genomes. If your -reference genome is larger than 3.5GB, the `samtools index` calls in the -pipeline need to generate `CSI` indices instead of `BAI` indices to compensate -for the size of the reference genome (with samtools: `-c`). This parameter is -not required for smaller references (including the human `hg19` or -`grch37`/`grch38` references), but `>4GB` genomes have been shown to need `CSI` -indices. Default: off - -> modifies SAMtools index command: `-c` - -#### `--save_reference` - -Use this if you do not have pre-made reference FASTA indices for `bwa`, -`samtools` and `picard`. If you turn this on, the indices nf-core/eager -generates for you and will be saved in the -`/results/reference_genomes` for you. If not supplied, -nf-core/eager generated index references will be deleted. - -### Output - -#### `--outdir` - -The output directory where the results will be saved. - -#### `-w / -work-dir` - -The output directory where _intermediate_ files will be saved. It is **highly -recommended** that this is the same path as `--outdir`, otherwise you may 'lose' -your intermediate files if you need to re-run a pipeline. By default, if this -flag is not given, the intermediate files will be saved in a `work/` and -`.nextflow/` directory from wherever you have run nf-core/eager from. - -#### `--publish_dir_mode` - -Nextflow mode for 'publishing' final results files i.e. how to move final files -into your `--outdir` from working directories. Options: 'symlink', 'rellink', -'link', 'copy', 'copyNoFollow', 'move'. Default: 'copy'. - -> It is recommended to select `copy` (default) if you plan to regularly delete -> intermediate files from `work/`. - -### Other run specific parameters - -#### `--max_memory` - -Use to set a top-limit for the default memory requirement for each process. -Should be a string in the format integer-unit. eg. `--max_memory '8.GB'`. If not -specified, will be taken from the configuration in the `-profile` flag. - -#### `--max_time` - -Use to set a top-limit for the default time requirement for each process. Should -be a string in the format integer-unit. eg. `--max_time '2.h'`. If not -specified, will be taken from the configuration in the `-profile` flag. - -#### `--max_cpus` - -When _not_ using a institute specific `-profile`, you can use this parameter to -set a top-limit for the default CPU requirement for each **process**. This is -not the maximum number of CPUs that can be used for the whole pipeline, but the -maximum number of CPUs each program can use for each program submission (known -as a process). - -Do not set this higher than what is available on your workstation or computing -node can provide. If you're unsure, ask your local IT administrator for details -on compute node capabilities! Should be a string in the format integer-unit. eg. -`--max_cpus 1`. If not specified, will be taken from the configuration in the -`-profile` flag. - -#### `--email` - -Set this parameter to your e-mail address to get a summary e-mail with details -of the run sent to you when the workflow exits. If set in your user config file -(`~/.nextflow/config`) then you don't need to specify this on the command line -for every run. - -Note that this functionality requires either `mail` or `sendmail` to be -installed on your system. - -#### `--email_on_fail` - -Set this parameter to your e-mail address to get a summary e-mail with details -of the run if it fails. Normally would be the same as in `--email`. If set in -your user config file (`~/.nextflow/config`) then you don't need to specify this -on the command line for every run. - -> Note that this functionality requires either `mail` or `sendmail` to be -> installed on your system. - -#### `--plaintext_email` - -Set to receive plain-text e-mails instead of HTML formatted. - -#### `--monochrome_logs` - -Set to disable colourful command line output and live life in monochrome. - -#### `--multiqc_config` - -Specify a path to a custom MultiQC configuration file. - -#### `--custom_config_version` - -Provide git commit id for custom Institutional configs hosted at -`nf-core/configs`. This was implemented for reproducibility purposes. Default is -set to `master`. - -```bash -\#\# Download and use config file with following git commit id ---custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96 -``` - -### Step skipping parameters - -Some of the steps in the pipeline can be executed optionally. If you specify -specific steps to be skipped, there won't be any output related to these -modules. - -#### `--skip_fastqc` - -Turns off FastQC pre- and post-Adapter Removal, to speed up the pipeline. Use of -this flag is most common when data has been previously pre-processed and the -post-Adapter Removal mapped reads are being re-mapped to a new reference genome. - -#### `--skip_adapterremoval` - -Turns off adapter trimming and paired-end read merging. Equivalent to setting -both `--skip_collapse` and `--skip_trim`. - -#### `--skip_preseq` - -Turns off the computation of library complexity estimation. - -#### `--skip_deduplication` - -Turns off duplicate removal methods DeDup and MarkDuplicates respectively. No -duplicates will be removed on any data in the pipeline. - -#### `--skip_damage_calculation` - -Turns off the DamageProfiler module to compute DNA damage profiles. - -#### `--skip_qualimap` - -Turns off QualiMap and thus does not compute coverage and other mapping metrics. - -### Complexity Filtering Options - -More details can be seen in the [fastp -documentation](https://github.com/OpenGene/fastp) - -If using TSV input, this is performed per lane separately. - -#### `--complexity_filter_poly_g` - -Performs a poly-G tail removal step in the beginning of the pipeline using -`fastp`, if turned on. This can be useful for trimming ploy-G tails from -short-fragments sequenced on two-colour Illumina chemistry such as NextSeqs -(where no-fluorescence is read as a G on two-colour chemistry), which can -inflate reported GC content values. - -#### `--complexity_filter_poly_g_min` - -This option can be used to define the minimum length of a poly-G tail to begin -low complexity trimming. By default, this is set to a value of `10` unless the -user has chosen something specifically using this option. - -> Modifies fastp parameter: `--poly_g_min_len` - -### Adapter Clipping and Merging Options - -These options handle various parts of adapter clipping and read merging steps. - -More details can be seen in the [AdapterRemoval -documentation](https://adapterremoval.readthedocs.io/en/latest/) - -If using TSV input, this is performed per lane separately. - -> :warning: `--skip_trim` will skip adapter clipping AND quality trimming -> (n, base quality). It is currently not possible skip one or the other. - -#### `--clip_forward_adaptor` - -Defines the adapter sequence to be used for the forward read. By default, this -is set to `'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'`. - -> Modifies AdapterRemoval parameter: `--adapter1` - -#### `--clip_reverse_adaptor` - -Defines the adapter sequence to be used for the reverse read in paired end -sequencing projects. This is set to `'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'` by -default. - -> Modifies AdapterRemoval parameter: `--adapter2` - -#### `--clip_readlength` - -Defines the minimum read length that is required for reads after merging to be -considered for downstream analysis after read merging. Default is `30`. - -Note that performing read length filtering at this step is not reliable for -correct endogenous DNA calculation, when you have a large percentage of very -short reads in your library - such as retrieved in single-stranded library -protocols. When you have very few reads passing this length filter, it will -artificially inflate your endogenous DNA by creating a very small denominator. -In these cases it is recommended to set this to 0, and use -`--bam_filter_minreadlength` instead, to filter out 'un-usable' short reads -after mapping. - -> Modifies AdapterRemoval parameter: `--minlength` - -#### `--clip_min_read_quality` - -Defines the minimum read quality per base that is required for a base to be -kept. Individual bases at the ends of reads falling below this threshold will be -clipped off. Default is set to `20`. - -> Modifies AdapterRemoval parameter: `--minquality` - -#### `--clip_min_adap_overlap` - -Sets the minimum overlap between two reads when read merging is performed. -Default is set to `1` base overlap. - -> Modifies AdapterRemoval parameter: `--minadapteroverlap` - -#### `--skip_collapse` - -Turns off the paired-end read merging. - -For example - -```bash ---skip_collapse --input '*_{R1,R2}_*.fastq' -``` - -It is important to use the paired-end wildcard globbing, as `--skip_collapse` -can only be used on paired-end data! - -:warning: If you provide this option together with `--clip_readlength` set to -something (as is by default), you may end up removing single reads from either -the pair1 or pair2 file. These will be NOT be mapped when aligning with either -`bwa` or `bowtie`, as both can only accept one (forward) or two (forward and -reverse) FASTQs as input. - -> Modifies AdapterRemoval parameter: `--collapse` - -#### `--skip_trim` - -Turns off adapter AND quality trimming. - -For example: - -```bash ---skip_trim --input '*.fastq' -``` - -:warning: it is not possible to keep quality trimming (n or base quality) on, -_and_ skip adapter trimming. - -:warning: it is not possible to turn off one or the other of quality -trimming or n trimming. i.e. --trimns --trimqualities are both given -or neither. However setting quality in `--clip_min_read_quality` to 0 would -theoretically turn off base quality trimming. - -> Modifies AdapterRemoval parameters: `--trimns --trimqualities --adapter1 --adapter2` - -#### `--preserve5p` - -Turns off quality based trimming at the 5p end of reads when any of the ---trimns, --trimqualities, or --trimwindows options are used. Only 3p end of -reads will be removed. - -This also entirely disables quality based trimming of collapsed reads, since -both ends of these are informative for PCR duplicate filtering. Described -[here](https://github.com/MikkelSchubert/adapterremoval/issues/32#issuecomment-504758137). - -> Modifies AdapterRemoval parameters: `--preserve5p` - -#### `--mergedonly` - -Specify that only merged reads are sent downstream for analysis. - -Singletons (i.e. reads missing a pair), or un-merged reads (where there wasn't -sufficient overlap) are discarded. - -You may want to use this if you want ensure only the best quality reads for your -analysis, but with the penalty of potentially losing still valid data (even if -some reads have slightly lower quality). It is highly recommended when using -`--dedupper 'dedup'` (see below). - -### Read Mapping Parameters - -If using TSV input, mapping is performed at the library level, i.e. after lane -merging. - -#### `--mapper` - -Specify which mapping tool to use. Options are BWA aln (`'bwaaln'`), BWA mem -(`'bwamem'`), circularmapper (`'circularmapper'`), or bowtie2 (`bowtie2`). BWA -aln is the default and highly suited for short-read ancient DNA. BWA mem can be -quite useful for modern DNA, but is rarely used in projects for ancient DNA. -CircularMapper enhances the mapping procedure to circular references, using the -BWA algorithm but utilizing a extend-remap procedure (see Peltzer et al 2016, -Genome Biology for details). Bowtie2 is similar to BWA aln, and has recently -been suggested to provide slightly better results under certain conditions -([Poullet and Orlando 2020](https://doi.org/10.3389/fevo.2020.00105)), as well -as providing extra functionality (such as FASTQ trimming). Default is 'bwaaln' - -More documentation can be seen for each tool under: - -- [BWA aln](http://bio-bwa.sourceforge.net/bwa.shtml#3) -- [BWA mem](http://bio-bwa.sourceforge.net/bwa.shtml#3) -- [CircularMapper](https://circularmapper.readthedocs.io/en/latest/contents/userguide.html) -- [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line) - -#### BWA (default) - -These parameters configure mapping algorithm parameters. - -##### `--bwaalnn` - -Defines how many mismatches from the reference are allowed in a read. By default -set to `0.04` (following recommendations of [Schubert et al. (2012 _BMC -Genomics_)](https://doi.org/10.1186/1471-2164-13-178)), if you're uncertain what -to set check out [this](https://apeltzer.shinyapps.io/bwa-mismatches/) Shiny App -for more information on how to set this parameter efficiently. - -> Modifies bwa aln parameter: `-n` - -##### `--bwaalnk` - -Modifies the number of mismatches in the _seed_ during the seeding phase in the -`bwa aln` mapping algorithm. Default is set to `2`. - -> Modifies BWA aln parameter: `-k` - -##### `--bwaalnl` - -Configures the length of the seed used during seeding. Default is set to be -'turned off' at the recommendation of Schubert et al. ([2012 _BMC -Genomics_](https://doi.org/10.1186/1471-2164-13-178)) for ancient DNA with -`1024`. - -Note: Despite being recommended, turning off seeding can result in long -runtimes! - -> Modifies BWA aln parameter: `-l` - -#### CircularMapper - -##### `--circularextension` - -The number of bases to extend the reference genome with. By default this is set -to `500` if not specified otherwise. - -> Modifies circulargenerator and realignsamfile parameter: `-e` - -##### `--circulartarget` - -The chromosome in your FASTA reference that you'd like to be treated as -circular. By default this is set to `MT` but can be configured to match any -other chromosome. - -> Modifies circulargenerator parameter: `-s` - -##### `--circularfilter` - -If you want to filter out reads that don't map to a circular chromosome, turn -this on. By default this option is turned off. - -#### Bowtie2 - -##### `--bt2_alignmode` - -The type of read alignment to use. Options are 'local' or 'end-to-end'. Local -allows only partial alignment of read, with ends of reads possibly -'soft-clipped' (i.e. remain unaligned/ignored), if the soft-clipped alignment -provides best alignment score. End-to-end requires all nucleotides to be -aligned. Default is 'local', following [Cahill et al -(2018)](https://doi.org/10.1093/molbev/msy018) and [Poullet and Orlando -2020](https://doi.org/10.3389/fevo.2020.00105). - -> Modifies Bowtie2 parameters: `--very-fast --fast --sensitive --very-sensitive --very-fast-local --fast-local --sensitive-local --very-sensitive-local` - -##### `--bt2_sensitivity` - -The Bowtie2 'preset' to use. Options: 'no-preset' 'very-fast', 'fast', -'sensitive', or 'very-sensitive'. These strings apply to both `--bt2_alignmode` -options. See the Bowtie2 -[manual](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line) -for actual settings. Default is 'sensitive' (following [Poullet and Orlando -(2020)](https://doi.org/10.3389/fevo.2020.00105), when running damaged-data -_without_ UDG treatment) - -> Modifies Bowtie2 parameters: `--very-fast --fast --sensitive --very-sensitive --very-fast-local --fast-local --sensitive-local --very-sensitive-local` - -##### `--bt2n` - -The number of mismatches allowed in the seed during seed-and-extend procedure of -Bowtie2. This will override any values set with `--bt2_sensitivity`. Can either -be 0 or 1. Default: 0 (i.e. use`--bt2_sensitivity` defaults). - -> Modifies Bowtie2 parameters: `-N` - -##### `--bt2l` - -The length of the seed sub-string to use during seeding. This will override any -values set with `--bt2_sensitivity`. Default: 0 (i.e. use`--bt2_sensitivity` -defaults: [20 for local and 22 for -end-to-end](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line). - -> Modifies Bowtie2 parameters: `-L` - -##### `-bt2_trim5` - -Number of bases to trim at the 5' (left) end of read prior alignment. Maybe -useful when left-over sequencing artefacts of in-line barcodes present Default: -0 - -> Modifies Bowtie2 parameters: `-bt2_trim5` - -##### `-bt2_trim3` - -Number of bases to trim at the 3' (right) end of read prior alignment. Maybe -useful when left-over sequencing artefacts of in-line barcodes present Default: -0. - -> Modifies Bowtie2 parameters: `-bt2_trim3` - -### Removal of Host-Mapped Reads - -These parameters are used for removing mapped reads from the original input -FASTQ files, usually in the context of uploading the original FASTQ files to a -public read archive (NCBI SRA/EBI ENA/DDBJ SRA). - -These flags will produce FASTQ files almost identical to your input files, -except that reads with the same read ID as one found in the mapped bam file, are -either removed or 'masked' (every base replaced with Ns). - -This functionality allows you to provide other researchers who wish to re-use -your data to apply their own adapter removal/read merging procedures, while -maintaining anonymity for sample donors - for example with microbiome -research. - -If using TSV input, mapped read removal is performed per library, i.e. after -lane merging. - -#### `--hostremoval_input_fastq` - -Create pre-Adapter Removal FASTQ files without reads that mapped to reference -(e.g. for public upload of privacy sensitive non-host data) - -#### `--hostremoval_mode` - -Read removal mode. Completely remove mapped reads from the file(s) (`'remove'`) -or just replace mapped reads sequence by N (`'replace'`) - -> Modifies extract_map_reads.py parameter: `-m` - -### Read Filtering and Conversion Parameters - -Users can configure to keep/discard/extract certain groups of reads efficiently -in the nf-core/eager pipeline. - -If using TSV input, filtering is performed library, i.e. after lane merging. - -This module utilises `samtools view` and `filter_bam_fragment_length.py` - -#### `--run_bam_filtering` - -Turns on the bam filtering module for either mapping quality filtering or -unmapped read treatment. - -> :warning: this is **required** for metagenomic screening! - -#### `--bam_mapping_quality_threshold` - -Specify a mapping quality threshold for mapped reads to be kept for downstream -analysis. By default keeps all reads and is therefore set to `0` (basically -doesn't filter anything). - -> Modifies samtools view parameter: `-q` - -#### `--bam_filter_minreadlength` - -Specify minimum length of mapped reads. This filtering will apply at the same -time as mapping quality filtering. - -If used _instead_ of minimum length read filtering at AdapterRemoval, this can -be useful to get more realistic endogenous DNA percentages, when most of your -reads are very short (e.g. in single-stranded libraries) and would otherwise be -discarded by AdapterRemoval (thus making an artificially small denominator for a -typical endogenous DNA calculation). Note in this context you should not perform -mapping quality filtering nor discarding of unmapped reads to ensure a correct -denominator of all reads, for the endogenous DNA calculation. - -> Modifies filter_bam_fragment_length.py parameter: `-l` - -#### `--bam_unmapped_type` - -Defines how to proceed with unmapped reads: `'discard'` removes all unmapped -reads, `keep` keeps both unmapped and mapped reads in the same BAM file, `'bam'` -keeps unmapped reads as BAM file, `'fastq'` keeps unmapped reads as FastQ file, -`both` keeps both BAM and FASTQ files. Default is `discard`. `keep` is what -would happen if `--run_bam_filtering` was _not_ supplied. - -Note that in all cases, if `--bam_mapping_quality_threshold` is also supplied, -mapping quality filtering will still occur on the mapped reads. - -:warning: `--bam_unmapped_type 'fastq'` is **required** for metagenomic -screening! - -> Modifies samtools view parameter: `-f4 -F4` - -### Read DeDuplication Parameters - -If using TSV input, deduplication is performed per library, i.e. after lane merging. - -#### `--dedupper` - -Sets the duplicate read removal tool. By default uses `markduplicates` from -Picard. Alternatively an ancient DNA specific read deduplication tool `dedup` -([Peltzer et al. 2016](http://dx.doi.org/10.1186/s13059-016-0918-z)) is offered. - -This utilises both ends of paired-end data to remove duplicates (i.e. true exact -duplicates, as markduplicates will over-zealously deduplicate anything with the -same starting position even if the ends are different). DeDup should generally -only be used solely on paired-end data otherwise suboptimal deduplication can -occur if applied to either single-end or a mix of single-end/paired-end data. - -#### `--dedup_all_merged` - -Sets DeDup to treat all reads as merged reads. This is useful if reads are for -example not prefixed with `M_`, `R_`, or `L_` in all cases. Therefore, this can -be used as a workaround when also using a mixture of paired-end and single-end -data, however this is not recommended (see above). - -> Modifies dedup parameter: `-m` - -### Library Complexity Estimation Parameters - -nf-core/eager uses Preseq on mapped reads as one method to calculate library -complexity. If DeDup is used, Preseq uses the histogram output of DeDup, -otherwise the sorted non-duplicated BAM file is supplied. Furthermore, if -paired-end read collapsing is not performed, the `-P` flag is used. - -#### `--preseq_step_size` - -Can be used to configure the step size of Preseq's `c_curve` method. Can be -useful when only few and thus shallow sequencing results are used for -extrapolation. - -> Modifies preseq c_curve parameter: `-s` - -### DNA Damage Assessment Parameters - -More documentation can be seen in the follow links for: - -- [DamageProfiler](https://github.com/Integrative-Transcriptomics/DamageProfiler) -- [PMDTools documentation](https://github.com/pontussk/PMDtools) - -If using TSV input, DamageProfiler is performed per library, i.e. after lane -merging. PMDtools and BAM Trimming is run after library merging of same-named -library BAMs that have the same type of UDG treatment. BAM Trimming is only -performed on non-UDG and half-UDG treated data. - -#### `--damageprofiler_length` - -Specifies the length filter for DamageProfiler. By default set to `100`. - -> Modifies DamageProfile parameter: `-l` - -#### `--damageprofiler_threshold` - -Specifies the length of the read start and end to be considered for profile -generation in DamageProfiler. By default set to `15` bases. - -> Modifies DamageProfile parameter: `-t` - -#### `--damageprofiler_yaxis` - -Specifies what the maximum misincorporation frequency should be displayed as, in -the DamageProfiler damage plot. This is set to `0.30` (i.e. 30%) by default as -this matches the popular [mapDamage2.0](https://ginolhac.github.io/mapDamage) -program. However, the default behaviour of DamageProfiler is to 'autoscale' the -y-axis maximum to zoom in on any _possible_ damage that may occur (e.g. if the -damage is about 10%, the highest value on the y-axis would be set to 0.12). This -'autoscale' behaviour can be turned on by specifying the number to `0`. Default: -`0.30`. - -> Modifies DamageProfile parameter: `-yaxis_damageplot` - -#### `--run_pmdtools` - -Specifies to run PMDTools for damage based read filtering and assessment of DNA -damage in sequencing libraries. By default turned off. - -#### `--pmdtools_range` - -Specifies the range in which to consider DNA damage from the ends of reads. By -default set to `10`. - -> Modifies PMDTools parameter: `--range` - -#### `--pmdtools_threshold` - -Specifies the PMDScore threshold to use in the pipeline when filtering BAM files -for DNA damage. Only reads which surpass this damage score are considered for -downstream DNA analysis. By default set to `3` if not set specifically by the -user. - -> Modifies PMDTools parameter: `--threshold` - -#### `--pmdtools_reference_mask` - -Can be used to set a path to a reference genome mask for PMDTools. - -#### `--pmdtools_max_reads` - -The maximum number of reads used for damage assessment in PMDtools. Can be used -to significantly reduce the amount of time required for damage assessment in -PMDTools. Note that a too low value can also obtain incorrect results. - -> Modifies PMDTools parameter: `-n` - -### Feature Annotation Statistics - -If you're interested in looking at coverage stats for certain features on your -reference such as genes, SNPs etc., you can use the following bedtools module -for this purpose. - -More documentation on bedtools can be seen in the [bedtools -documentation](https://bedtools.readthedocs.io/en/latest/) - -If using TSV input, bedtools is run after library merging of same-named library -BAMs that have the same type of UDG treatment. - -#### `--run_bedtools_coverage` - -Specifies to turn on the bedtools module, producing statistics for breadth (or -percent coverage), and depth (or X fold) coverages. - -#### `--anno_file` - -Specify the path to a GFF/BED containing the feature coordinates (or any -acceptable input for [`bedtools -coverage`](https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html)). -Must be in quotes. - -### BAM Trimming Parameters - -For some library preparation protocols, users might want to clip off damaged -bases before applying genotyping methods. This can be done in nf-core/eager -automatically by turning on the `--run_trim_bam` parameter. - -More documentation can be seen in the [bamUtil -documentation](https://genome.sph.umich.edu/wiki/BamUtil:_trimBam) - -#### `--run_trim_bam` - -Turns on the BAM trimming method. Trims off `[n]` bases from reads in the -deduplicated BAM file. Damage assessment in PMDTools or DamageProfiler remains -untouched, as data is routed through this independently. BAM trimming is -typically performed to reduce errors during genotyping that can be caused by -aDNA damage. - -BAM trimming will only be performed on libraries indicated as `--udg_type -'none'` or `--udg_type 'half'`. Complete UDG treatment ('full') should have -removed all damage. The amount of bases that will be trimmed off can be set -separately for libraries with `--udg_type` `'none'` and `'half'` (see -`--bamutils_clip_half_udg_left` / `--bamutils_clip_half_udg_right` / -`--bamutils_clip_none_udg_left` / `--bamutils_clip_none_udg_right`). - -Note: additional artefacts such as bar-codes or adapters that could -potentially also be trimmed should be removed prior mapping. - -> Modifies bamUtil's trimBam parameter: `-L -R` - -#### `--bamutils_clip_half_udg_left` / `--bamutils_clip_half_udg_right` - -Default set to `1` and clips off one base of the left or right side of reads -from libraries whose UDG treatment is set to `half`. Note that reverse reads -will automatically be clipped off at the reverse side with this (automatically -reverses left and right for the reverse read). - -> Modifies bamUtil's trimBam parameter: `-L -R` - -#### `--bamutils_softclip` - -By default, nf-core/eager uses hard clipping and sets clipped bases to `N` with -quality `!` in the BAM output. Turn this on to use soft-clipping instead, -masking reads at the read ends respectively using the CIGAR string. - -> Modifies bamUtil's trimBam parameter: `-c` - -### Genotyping Parameters - -There are options for different genotypers (or genotype likelihood calculators) -to be used. We suggest you read the documentation of each tool to find the ones -that suit your needs. - -Documentation for each tool: - -- [GATK - UnifiedGenotyper](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.5-0/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php) -- [GATK - HaplotypeCaller](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php) -- [FreeBayes](https://github.com/ekg/freebayes) -- [ANGSD](http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods) -- [sequenceTools pileupCaller](https://github.com/stschiff/sequenceTools) - -If using TSV input, genotyping is performed per sample (i.e. after all types of -libraries are merged), except for pileupCaller which gathers all double-stranded -and single-stranded (same-type merged) libraries respectively. - -#### `--run_genotyping` - -Turns on genotyping to run on all post-dedup and downstream BAMs. For example if -`--run_pmdtools` and `--trim_bam` are both supplied, the genotyper will be run -on all three BAM files i.e. post-deduplication, post-pmd and post-trimmed BAM -files. - -#### `--genotyping_tool` - -Specifies which genotyper to use. Current options are: GATK (v3.5) -UnifiedGenotyper or GATK Haplotype Caller (v4); and the FreeBayes Caller. -Specify 'ug', 'hc', 'freebayes', 'pileupcaller' and 'angsd' respectively. - -> Note that while UnifiedGenotyper is more suitable for low-coverage ancient DNA -> (HaplotypeCaller does _de novo_ assembly around each variant site), be aware -> GATK 3.5 it is officially deprecated by the Broad Institute. - -#### `--genotyping_source` - -Indicates which BAM file to use for genotyping, depending on what BAM processing -modules you have turned on. Options are: `'raw'` for mapped only, filtered, or -DeDup BAMs (with priority right to left); `'trimmed'` (for base clipped BAMs); -`'pmd'` (for pmdtools output). Default is: `'raw'`. - -#### `--gatk_call_conf` - -If selected, specify a GATK genotyper phred-scaled confidence threshold of a -given SNP/INDEL call. Default: `30` - -> Modifies GATK UnifiedGenotyper or HaplotypeCaller parameter: -> `-stand_call_conf` - -#### `--gatk_ploidy` - -If selected, specify a GATK genotyper ploidy value of your reference organism. -E.g. if you want to allow heterozygous calls from >= diploid organisms. Default: -`2` - -> Modifies GATK UnifiedGenotyper or HaplotypeCaller parameter: `--sample-ploidy` - -#### `--gatk_downsample` - -Maximum depth coverage allowed for genotyping before down-sampling is turned on. -Any position with a coverage higher than this value will be randomly -down-sampled to 250 reads. Default: `250` - -> Modifies GATK UnifiedGenotyper parameter: `-dcov` - -#### `--gatk_dbsnp` - -(Optional) Specify VCF file for output VCF SNP annotation e.g. if you want to -annotate your VCF file with 'rs' SNP IDs. Check GATK documentation for more -information. Gzip not accepted. - -#### `--gatk_hc_out_mode` - -If the GATK genotyper HaplotypeCaller is selected, what type of VCF to create, -i.e. produce calls for every site or just confidence sites. Options: -`'EMIT_VARIANTS_ONLY'`, `'EMIT_ALL_CONFIDENT_SITES'`, `'EMIT_ALL_ACTIVE_SITES'`. -Default: `'EMIT_VARIANTS_ONLY'` - -> Modifies GATK HaplotypeCaller parameter: `-output_mode` - -#### `--gatk_hc_emitrefconf` - -If the GATK HaplotypeCaller is selected, mode for emitting reference confidence -calls. Options: `'NONE'`, `'BP_RESOLUTION'`, `'GVCF'`. Default: `'GVCF'` - -> Modifies GATK HaplotypeCaller parameter: `--emit-ref-confidence` - -#### `--gatk_ug_out_mode` - -If the GATK UnifiedGenotyper is selected, what type of VCF to create, -i.e. produce calls for every site or just confidence sites. Options: -`'EMIT_VARIANTS_ONLY'`, `'EMIT_ALL_CONFIDENT_SITES'`, `'EMIT_ALL_SITES'`. -Default: `'EMIT_VARIANTS_ONLY'` - -> Modifies GATK UnifiedGenotyper parameter: `--output_mode` - -#### `--gatk_ug_genotype_model` - -If the GATK UnifiedGenotyper is selected, which likelihood model to follow, i.e. -whether to call use SNPs or INDELS etc. Options: `'SNP'`, `'INDEL'`, `'BOTH'`, -`'GENERALPLOIDYSNP'`, `'GENERALPLOIDYINDEL`'. Default: `'SNP'` - -> Modifies GATK UnifiedGenotyper parameter: `--genotype_likelihoods_model` - -#### `--gatk_ug_keep_realign_bam` - -If provided when running GATK's UnifiedGenotyper, this will put the BAMs into -the output folder, that have realigned reads (with GATK's (v3) IndelRealigner) -around possible variants for improved genotyping. - -These BAMs will be stored in the same folder as the corresponding VCF files. - -#### `--gatk_ug_gatk_ug_defaultbasequalities` - -When running GATK's UnifiedGenotyper, specify a value to set base quality -scores, if reads are missing this information. Might be useful if you have -'synthetically' generated reads (e.g. chopping up a reference genome). Default -is set to -1 which is to not set any default quality (turned off). Default: -`-1` - -> Modifies GATK UnifiedGenotyper parameter: `--defaultBaseQualities` - -#### `--freebayes_C` - -Specify minimum required supporting observations to consider a variant. Default: -`1` - -> Modifies freebayes parameter: `-C` - -#### `--freebayes_g` - -Specify to skip over regions of high depth by discarding alignments overlapping -positions where total read depth is greater than specified C. Not set by -default. - -> Modifies freebayes parameter: `-g` - -#### `--freebayes_p` - -Specify ploidy of sample in FreeBayes. Default is diploid. Default: `2` - -> Modifies freebayes parameter: `-p` - -#### `--pileupcaller_bedfile` - -Specify a SNP panel in the form of a bed file of sites at which to generate -pileup for pileupCaller. - -#### `--pileupcaller_snpfile` - -Specify a SNP panel in -[EIGENSTRAT](https://github.com/DReichLab/EIG/tree/master/CONVERTF) format, -pileupCaller will call these sites. - -#### `--pileupcaller_method` - -Specify calling method to use. Options: randomHaploid, randomDiploid, -majorityCall. Default: `'randomHaploid'` - -> Modifies pileupCaller parameter: `--randomHaploid --randomDiploid --majorityCall` - -#### `--pileupcaller_transitions_mode` - -Specify if genotypes of transition SNPs should be called, set to missing, or -excluded from the genotypes respectively. Options: `'AllSites'`, -`'TransitionsMissing'`, `'SkipTransitions'`. Default: `'AllSites'` - -> Modifies pileupCaller parameter: `--skipTransitions --transitionsMissing` - -#### `--angsd_glmodel` - -Specify which genotype likelihood model to use. Options: `'samtools`, `'gatk'`, -`'soapsnp'`, `'syk'`. Default: `'samtools'` - -> Modifies ANGSD parameter: `-GL` - -#### `--angsd_glformat` - -Specifies what type of genotyping likelihood file format will be output. -Options: `'text'`, `'binary'`, `'binary_three'`, `'beagle_binary'`. Default: -`'text'`. - -The options refer to the following descriptions respectively: - -- `text`: textoutput of all 10 log genotype likelihoods. -- `binary`: binary all 10 log genotype likelihood -- `binary_three`: binary 3 times likelihood -- `beagle_binary`: beagle likelihood file - -See the [ANGSD documentation](http://www.popgen.dk/angsd/) for more information -on which to select for your downstream applications. - -> Modifies ANGSD parameter: `-doGlF` - -#### `--angsd_createfasta` - -Turns on the ANGSD creation of a FASTA file from the BAM file. - -#### `--angsd_fastamethod` - -The type of base calling to be performed when creating the ANGSD FASTA file. -Options: `'random'` or `'common'`. Will output the most common non-N base at -each given position, whereas 'random' will pick one at random. Default: -`'random'`. - -> Modifies ANGSD parameter: `-doFasta -doCounts` - -### Consensus Sequence Generation - -If using TSV input, consensus generation is performed per sample (i.e. after all -types of libraries are merged). - -#### `--run_vcf2genome` - -Turn on consensus sequence genome creation via VCF2Genome. Only accepts GATK -UnifiedGenotyper VCF files with the `--gatk_ug_out_mode 'EMIT_ALL_SITES'` and -`--gatk_ug_genotype_model 'SNP` flags. Typically useful for small genomes such -as mitochondria. - -#### `--vcf2genome_outfile` - -The name of your requested output FASTA file. Do not include `.fasta` suffix. - -#### `--vcf2genome_header` - -The name of the FASTA entry you would like in your FASTA file. - -#### `--vcf2genome_minc` - -Minimum depth coverage for a SNP to be made. Else, a SNP will be called as N. -Default: `5` - -> Modifies VCF2Genome parameter: `-minc` - -#### `--vcf2genome_minq` - -Minimum genotyping quality of a call to be made. Else N will be called. -Default: `30` - -> Modifies VCF2Genome parameter: `-minq` - -#### `--vcf2genome_minfreq` - -In the case of two possible alleles, the frequency of the majority allele -required for a call to be made. Else, a SNP will be called as N. Default: `0.8` - -> Modifies VCF2Genome parameter: `-minfreq` - -### SNP Table Generation - -SNP Table Generation here is performed by MultiVCFAnalyzer. The current version -of MultiVCFAnalyzer version only accepts GATK UnifiedGenotyper 3.5 VCF files, -and when the ploidy was set to 2 (this allows MultiVCFAnalyzer to report -frequencies of polymorphic positions). A description of how the tool works can -be seen in the Supplementary Information of [Bos et al. -(2014)](https://doi.org/10.1038/nature13591) under "SNP Calling and Phylogenetic -Analysis". - -More can be seen in the [MultiVCFAnalyzer -documentation](https://github.com/alexherbig/MultiVCFAnalyzer). - -If using TSV input, MultiVCFAnalyzer is performed on all samples gathered -together. - -#### `--run_multivcfanalyzer` - -Turns on MultiVCFAnalyzer. Will only work when in combination with -UnifiedGenotyper genotyping module (see -[`--genotyping_tool`](#--genotyping_tool)). - -#### `--write_allele_frequencies` - -Specify whether to tell MultiVCFAnalyzer to write within the SNP table the -frequencies of the allele at that position e.g. A (70%). - -#### `--min_genotype_quality` - -The minimal genotyping quality for a SNP to be considered for processing by -MultiVCFAnalyzer. The default threshold is `30`. - -#### `--min_base_coverage` - -The minimal number of reads covering a base for a SNP at that position to be -considered for processing by MultiVCFAnalyzer. The default depth is `5`. - -#### `--min_allele_freq_hom` - -The minimal frequency of a nucleotide for a 'homozygous' SNP to be called. In -other words, e.g. 90% of the reads covering that position must have that SNP to -be called. If the threshold is not reached, and the previous two parameters are -matched, a reference call is made (displayed as . in the SNP table). If the -above two parameters are not met, an 'N' is called. The default allele frequency -is `0.9`. - -#### `--min_allele_freq_het` - -The minimum frequency of a nucleotide for a 'heterozygous' SNP to be called. If -this parameter is set to the same as `--min_allele_freq_hom`, then only -homozygous calls are made. If this value is less than the previous parameter, -then a SNP call will be made. If it is between this and the previous parameter, -it will be displayed as a IUPAC uncertainty call. Default is `0.9`. - -#### `--additional_vcf_files` - -If you wish to add to the table previously created VCF files, specify here a -path with wildcards (in quotes). These VCF files must be created the same way as -your settings for [GATK UnifiedGenotyping](#genotyping-parameters) module above. - -#### `--reference_gff_annotations` - -If you wish to report in the SNP table annotation information for the regions -SNPs fall in, provide a file in GFF format (the path must be in quotes). - -#### `--reference_gff_exclude` - -If you wish to exclude SNP regions from consideration by MultiVCFAnalyzer (such -as for problematic regions), provide a file in GFF format (the path must be in -quotes). - -#### `--snp_eff_results` - -If you wish to include results from SNPEff effect analysis, supply the output -from SNPEff in txt format (the path must be in quotes). - -### Mitochondrial to Nuclear Ratio - -If using TSV input, Mitochondrial to Nuclear Ratio calculation is calculated per -deduplicated library (after lane merging) - -#### `--run_mtnucratio` - -Turn on the module to estimate the ratio of mitochondrial to nuclear reads. - -#### `--mtnucratio_header` - -Specify the FASTA entry in the reference file specified as `--fasta`, which acts -as the mitochondrial 'chromosome' to base the ratio calculation on. The tool -only accepts the first section of the header before the first space. The default -chromosome name is based on hs37d5/GrCH37 human reference genome. Default: 'MT' - -### Human Sex Determination - -An optional process for human DNA. It can be used to calculate the relative -coverage of X and Y chromosomes compared to the autosomes (X-/Y-rate). Standard -errors for these measurements are also calculated, assuming a binomial -distribution of reads across the SNPs. - -If using TSV input, SexDetERRmine is performed on all samples gathered together. - -#### `--run_sexdeterrmine` - -Specify to run the optional process of sex determination. - -#### `--sexdeterrmine_bedfile` - -Specify an optional bedfile of the list of SNPs to be used for X-/Y-rate -calculation. Running without this parameter will considerably increase runtime, -and render the resulting error bars untrustworthy. Theoretically, any set of -SNPs that are distant enough that two SNPs are unlikely to be covered by the -same read can be used here. The programme was coded with the 1240K panel in -mind. The path must be in quotes. - -### Human Nuclear Contamination - -#### `--run_nuclear_contamination` - -Specify to run the optional processes for human nuclear DNA contamination -estimation. - -#### `--contamination_chrom_name` - -The name of the chromosome X in your bam. `'X'` for hs37d5, `'chrX'` for HG19. -Defaults to `'X'`. - -### Metagenomic Screening - -An increasingly common line of analysis in high-throughput aDNA analysis today -is simultaneously screening off target reads of the host for endogenous -microbial signals - particularly of pathogens. Metagenomic screening is -currently offered via MALT with aDNA specific verification via MaltExtract, or -Kraken2. - -Please note the following: - -- :warning: Metagenomic screening is only performed on _unmapped_ reads from a - mapping step. - - You _must_ supply the `--run_bam_filtering` flag with unmapped reads in - FASTQ format. - - If you wish to run solely MALT (i.e. the HOPS pipeline), you must still - supply a small decoy genome like phiX or human mtDNA `--fasta`. -- MALT database construction functionality is _not_ included within the pipeline - - this should be done independently, **prior** the nf-core/eager run. - - To use `malt-build` from the same version as `malt-run`, load either the - Docker, Singularity or Conda environment. -- MALT can often require very large computing resources depending on your - database. We set a absolute minimum of 16 cores and 128GB of memory (which is - 1/4 of the recommendation from the developer). Please leave an issue on the - [nf-core github](https://github.com/nf-core/eager/issues) if you would like to - see this changed. - -> :warning: Running MALT on a server with less than 128GB of memory should be -> performed at your own risk. - -If using TSV input, metagenomic screening is performed on all samples gathered -together. - -#### `--run_metagenomic_screening` - -Turn on the metagenomic screening module. - -#### `--metagenomic_tool` - -Specify which taxonomic classifier to use. There are two options available: - -- `kraken` for [Kraken2](https://ccb.jhu.edu/software/kraken2) -- `malt` for [MALT](https://software-ab.informatik.uni-tuebingen.de/download/malt/welcome.html) - -:warning: **Important** It is very important to run `nextflow clean -f` on your -Nextflow run directory once completed. RMA6 files are VERY large and are -_copied_ from a `work/` directory into the results folder. You should clean the -work directory with the command to ensure non-redundancy and large HDD -footprints! - -#### `--database` - -Specify the path to the _directory_ containing your taxonomic classifier's -database (malt or kraken). - -For Kraken2, it can be either the path to the _directory_ or the path to the -`.tar.gz` compressed directory of the Kraken2 database. - -#### `--metagenomic_min_support_reads` - -Specify the minimum number of reads a given taxon is required to have to be -retained as a positive 'hit'. -For malt, this only applies when `--malt_min_support_mode` is set to 'reads'. -Default: `1`. - -> Modifies MALT or kraken_parse.py parameter: `-sup` and `-c` respectively - -#### `--percent_identity` - -Specify the minimum percent identity (or similarity) a sequence must have to the -reference for it to be retained. Default is `85` - -Only used when `--metagenomic_tool malt` is also supplied. - -> Modifies MALT parameter: `-id` - -#### `--malt_mode` - -Use this to run the program in 'BlastN', 'BlastP', 'BlastX' modes to align DNA -and DNA, protein and protein, or DNA reads against protein references -respectively. Ensure your database matches the mode. Check the -[MALT -manual](http://ab.inf.uni-tuebingen.de/data/software/malt/download/manual.pdf) -for more details. Default: `'BlastN'` - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MALT parameter: `-m` - -#### `--malt_alignment_mode` - -Specify what alignment algorithm to use. Options are 'Local' or 'SemiGlobal'. -Local is a BLAST like alignment, but is much slower. Semi-global alignment -aligns reads end-to-end. Default: `'SemiGlobal'` - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MALT parameter: `-at` - -#### `--malt_top_percent` - -Specify the top percent value of the LCA algorithm. From the [MALT -manual](http://ab.inf.uni-tuebingen.de/data/software/malt/download/manual.pdf): -"For each read, only those matches are used for taxonomic placement whose bit -disjointScore is within 10% of the best disjointScore for that read.". Default: -`1`. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MALT parameter: `-top` - -#### `--malt_min_support_mode` - -Specify whether to use a percentage, or raw number of reads as the value used to -decide the minimum support a taxon requires to be retained. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MALT parameter: `-sup -supp` - -#### `--malt_min_support_percent` - -Specify the minimum number of reads (as a percentage of all assigned reads) a -given taxon is required to have to be retained as a positive 'hit' in the RMA6 -file. This only applies when `--malt_min_support_mode` is set to 'percent'. -Default 0.01. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MALT parameter: `-supp` - -#### `--malt_max_queries` - -Specify the maximum number of alignments a read can have. All further alignments -are discarded. Default: `100` - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MALT parameter: `-mq` - -#### `--malt_memory_mode` - -How to load the database into memory. Options are `'load'`, `'page'` or `'map'`. -'load' directly loads the entire database into memory prior seed look up, this -is slow but compatible with all servers/file systems. `'page'` and `'map'` -perform a sort of 'chunked' database loading, allowing seed look up prior entire -database loading. Note that Page and Map modes do not work properly not with -many remote file-systems such as GPFS. Default is `'load'`. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MALT parameter: `--memoryMode` - -#### `--malt_sam_output` - -Specify to _also_ produce gzipped SAM files of all alignments and un-aligned -reads in addition to RMA6 files. These are **not** soft-clipped or in 'sparse' -format. Can be useful for downstream analyses due to more common file format. - -:warning: can result in very large run output directories as this is -essentially duplication of the RMA6 files. - -> Modifies MALT parameter `-a -f` - -### Metagenomic Authentication - -#### `--run_maltextract` - -Turn on MaltExtract for MALT aDNA characteristics authentication of metagenomic -output from MALT. - -More can be seen in the [MaltExtract -documentation](https://github.com/rhuebler/) - -Only when `--metagenomic_tool malt` is also supplied - -#### `--maltextract_taxon_list` - -Path to a `.txt` file with taxa of interest you wish to assess for aDNA -characteristics. In `.txt` file should be one taxon per row, and the taxon -should be in a valid [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) name -format. - -Only when `--metagenomic_tool malt` is also supplied. - -#### `--maltextract_ncbifiles` - -Path to directory containing containing the NCBI resource tree and taxonomy -table files (ncbi.tre and ncbi.map; available at the [HOPS -repository](https://github.com/rhuebler/HOPS/Resources)). - -Only when `--metagenomic_tool malt` is also supplied. - -#### `--maltextract_filter` - -Specify which MaltExtract filter to use. This is used to specify what types of -characteristics to scan for. The default will output statistics on all -alignments, and then a second set with just reads with one C to T mismatch in -the first 5 bases. Further details on other parameters can be seen in the [HOPS -documentation](https://github.com/rhuebler/HOPS/#maltextract-parameters). -Options: `'def_anc'`, `'ancient'`, `'default'`, `'crawl'`, `'scan'`, `'srna'`, -'assignment'. Default: `'def_anc'`. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MaltExtract parameter: `-f` - -#### `--maltextract_toppercent` - -Specify frequency of top alignments for each read to be considered for each node. -Default is 0.01, i.e. 1% of all reads (where 1 would correspond to 100%). - -> :warning: this parameter follows the same concept as `--malt_top_percent` but -> uses a different notation i.e. integer (MALT) versus float (MALTExtract) - -Default: `0.01`. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MaltExtract parameter: `-a` - -#### `--maltextract_destackingoff` - -Turn off destacking. If left on, a read that overlaps with another read will be -removed (leaving a depth coverage of 1). - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MaltExtract parameter: `--destackingOff` - -#### `--maltextract_downsamplingoff` - -Turn off downsampling. By default, downsampling is on and will randomly select -10,000 reads if the number of reads on a node exceeds this number. This is to -speed up processing, under the assumption at 10,000 reads the species is a 'true -positive'. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MaltExtract parameter: `--downSampOff` - -#### `--maltextract_duplicateremovaloff` - -Turn off duplicate removal. By default, reads that are an exact copy (i.e. same -start, stop coordinate and exact sequence match) will be removed as it is -considered a PCR duplicate. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MaltExtract parameter: `--dupRemOff` - -#### `--maltextract_matches` - -Export alignments of hits for each node in BLAST format. By default turned off. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MaltExtract parameter: `--matches` - -#### `--maltextract_megansummary` - -Export 'minimal' summary files (i.e. without alignments) that can be loaded into -[MEGAN6](https://doi.org/10.1371/journal.pcbi.1004957). By default turned off. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MaltExtract parameter: `--meganSummary` - -#### `--maltextract_percentidentity` - -Minimum percent identity alignments are required to have to be reported. Higher -values allows fewer mismatches between read and reference sequence, but -therefore will provide greater confidence in the hit. Lower values allow more -mismatches, which can account for damage and divergence of a related -strain/species to the reference. Recommended to set same as MALT parameter or -higher. Default: `85.0`. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MaltExtract parameter: `--minPI` - -#### `maltextract_topalignment` - -Use the best alignment of each read for every statistic, except for those -concerning read distribution and coverage. Default: off. - -Only when `--metagenomic_tool malt` is also supplied. - -> Modifies MaltExtract parameter: `--useTopAlignment` - -### Clean up + - You should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses. +- nf-core/eager will only merge multiple _lanes_ of sequencing runs with the same single-end or paired-end configuration +- Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files (unless you use `--run_convertbam`), as only FASTQ files are lane-merged together. +- Same libraries that are sequenced on different sequencing configurations (i.e single- and paired-end data), will be merged after mapping and will _always_ be considered 'paired-end' during downstream processes + - **Important** running DeDup in this context is _not_ recommended, as PE and SE data at the same position will _not_ be evaluated as duplicates. Therefore not all duplicates will be removed. + - When you wish to run PE/SE data together `-dedupper markduplicates` is therefore preferred. + - An error will be thrown if you try to merge both PE and SE and also supply `--skip_merging`. + - If you truly want to mix SE data and PE data but using mate-pair info for PE mapping, please run FASTQ preprocessing mapping manually and supply BAM files for downstream processing by nf-core/eager + - If you _regularly_ want to run the situation above, please leave a feature request on github. +- DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on each unique library separately after deduplication (but prior same-treated library merging). +- nf-core/eager functionality such as `--run_trim_bam` will be applied to only non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. - Qualimap is run on each sample, after merging of libraries (i.e. your values will reflect the values of all libraries combined - after being damage trimmed etc.). +- Genotyping will be typically performed on each `sample` independently, as normally all libraries will have been merged together. However, if you have a mixture of single-stranded and double-stranded libraries, you will normally need to genotype separately. In this case you **must** give each the SS and DS libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file collision` error in steps such as `sexdeterrmine`, and then you will need to merge these yourself. We will consider changing this behaviour in the future if there is enough interest. + +## Clean up Once a run has completed, you will have _lots_ of (some very large) intermediate files in your output directory. These are stored within the directory named diff --git a/environment.yml b/environment.yml index a29487ea0..e0e05c545 100644 --- a/environment.yml +++ b/environment.yml @@ -1,3 +1,5 @@ +# You can use this file to create a conda environment for this pipeline: +# conda env create -f environment.yml name: nf-core-eager-2.2.2 channels: - conda-forge diff --git a/nextflow_schema.json b/nextflow_schema.json index 47d42f133..6c14dc9c9 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -9,7 +9,7 @@ "title": "Input/output options", "type": "object", "fa_icon": "fas fa-terminal", - "description": "Define where the pipeline should find input data and save output data.", + "description": "Define where the pipeline should find input data, and additional metadata.", "required": [ "input" ], @@ -19,14 +19,14 @@ "default": "null", "description": "Either paths or URLs to FASTQ/BAM data (must be surrounded with quotes). For paired end data, the path must use '{1,2}' notation to specify read pairs. Alternatively, a path to a TSV file (ending .tsv) containing file paths and sequencing/sample metadata. Allows for merging of multiple lanes/libraries/samples. Please see documentation for template.", "fa_icon": "fas fa-dna", - "help_text": "There are two possible ways of supplying input sequencing data to nf-core/eager.\nThe most efficient but more simplistic is supplying direct paths (with\nwildcards) to your FASTQ or BAM files, with each file or pair being considered a\nsingle library and each one run independently. TSV input requires creation of an\nextra file by the user and extra metadata, but allows more powerful lane and\nlibrary merging.\n\n##### Direct Input Method\n\nThis method is where you specify with `--input`, the path locations of FASTQ\n(optionally gzipped) or BAM file(s). This option is mutually exclusive to the\n[TSV input method](#tsv-input-method), which is used for more complex input\nconfigurations such as lane and library merging.\n\nWhen using the direct method of `--input` you can specify one or multiple\nsamples in one or more directories files. File names **must be unique**, even if\nin different directories. \n\nBy default, the pipeline _assumes_ you have paired-end data. If you want to run\nsingle-end data you must specify [`--single_end`]('#single_end')\n\nFor example, for a single set of FASTQs, or multiple paired-end FASTQ\nfiles in one directory, you can specify:\n\n```bash\n--input 'path/to/data/sample_*_{1,2}.fastq.gz'\n```\n\nIf you have multiple files in different directories, you can use additional\nwildcards (`*`) e.g.:\n\n```bash\n--input 'path/to/data/*/sample_*_{1,2}.fastq.gz'\n```\n\n> :warning: It is not possible to run a mixture of single-end and paired-end\n> files in one run with the paths `--input` method! Please see the [TSV input\n> method](#tsv-input-method) for possibilities.\n\n**Please note** the following requirements:\n\n1. Valid file extensions: `.fastq.gz`, `.fastq`, `.fq.gz`, `.fq`, `.bam`.\n2. The path **must** be enclosed in quotes\n3. The path must have at least one `*` wildcard character\n4. When using the pipeline with **paired end data**, the path must use `{1,2}`\n notation to specify read pairs.\n5. Files names must be unique, having files with the same name, but in different\n directories is _not_ sufficient\n - This can happen when a library has been sequenced across two sequencers on\n the same lane. Either rename the file, try a symlink with a unique name, or\n merge the two FASTQ files prior input.\n6. Due to limitations of downstream tools (e.g. FastQC), sample IDs may be\n truncated after the first `.` in the name, Ensure file names are unique prior\n to this!\n7. For input BAM files you should provide a small decoy reference genome with\n pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory\n parameter `--fasta` in order to avoid long computational time for generating\n the index files of the reference genome, even if you do not actual need a\n reference genome for any downstream analyses.\n\n##### TSV Input Method\n\nAlternatively to the [direct input method](#direct-input-method), you can supply\nto `--input` a path to a TSV file that contains paths to FASTQ/BAM files and\nadditional metadata. This allows for more complex procedures such as merging of\nsequencing data across lanes, sequencing runs, sequencing configuration types,\nand samples.\n\n

\n \"Schematic\n

\n\nThe use of the TSV `--input` method is recommended when performing\nmore complex procedures such as lane or library merging. You do not need to\nspecify `--single_end`, `--bam`, `--colour_chemistry`, `-udg_type` etc. when\nusing TSV input - this is defined within the TSV file itself. You can only\nsupply a single TSV per run (i.e. `--input '*.tsv'` will not work).\n\nThis TSV should look like the following:\n\n| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM |\n|-------------|------------|------|------------------|--------|----------|--------------|---------------|----|----|-----|\n| JK2782 | JK2782 | 1 | 4 | PE | Mammoth | double | full | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz) | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz) | NA |\n| JK2802 | JK2802 | 2 | 2 | SE | Mammoth | double | full | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz) | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz) | NA |\n\nA template can be taken from\n[here](https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/TSV_template.tsv).\n\n> :warning: Cells **must not** contain spaces before or after strings, as this\n> will make the TSV unreadable by nextflow. Strings containing spaces should be\n> wrapped in quotes.\n\nWhen using TSV_input, nf-core/eager will merge FASTQ files of libraries with the\nsame `Library_ID` but different `Lanes` values after adapter clipping (and\nmerging), assuming all other metadata columns are the same. If you have the same\n`Library_ID` but with different `SeqType`, this will be merged directly after\nmapping prior BAM filtering. Finally, it will also merge BAM files with the same\n`Sample_ID` but different `Library_ID` after duplicate removal, but prior to\ngenotyping. Please see caveats to this below.\n\nColumn descriptions are as follows:\n\n- **Sample_Name:** A text string containing the name of a given sample of which\n there can be multiple libraries. All libraries with the same sample name and\n same SeqType will be merged after deduplication.\n- **Library_ID:** A text string containing a given library, which there can be\n multiple sequencing lanes (with the same SeqType).\n- **Lane:** A number indicating which lane the library was sequenced on. Files\n from the libraries sequenced on different lanes (and different SeqType) will\n be concatenated after read clipping and merging.\n- **Colour Chemistry** A number indicating whether the Illumina sequencer the\n library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour\n chemistry machine. This informs whether poly-G trimming (if turned on) should\n be performed.\n- **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with\n both an R1 [or forward] and R2 [or reverse]) and single end data (only R1\n [forward], or BAM). This will affect lane merging if different per library.\n- **Organism:** A text string of the organism name of the sample or 'NA'. This\n currently has no functionality and can be set to 'NA', but will affect\n lane/library merging if different per library\n- **Strandedness:** A text string indicating whether the library type is\n 'single' or 'double'. This will affect lane/library merging if different per\n library.\n- **UDG_Treatment:** A text string indicating whether the library was generated\n with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library\n merging if different per library.\n- **R1:** A text string of a file path pointing to a forward or R1 FASTQ file.\n This can be used with the R2 column. File names **must be unique**, even if\n they are in different directories.\n- **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file,\n or 'NA' when single end data. This can be used with the R1 column. File names\n **must be unique**, even if they are in different directories.\n- **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot\n be specified at the same time as R1 or R2, both of which should be set to 'NA'\n\nFor example, the following TSV table:\n\n| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM |\n|-------------|------------|------|------------------|---------|----------|--------------|---------------|----------------------------------------------------------------|----------------------------------------------------------------|-----|\n| JK2782 | JK2782 | 7 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA |\n| JK2782 | JK2782 | 8 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz | NA |\n| JK2802 | JK2802 | 7 | 4 | PE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2802_AGAATAACCTACCA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA |\n| JK2802 | JK2802 | 8 | 4 | SE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz | NA | NA |\n\nwill have the following effects:\n\n- After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8\n _with the same `SeqType`_ (and all other _metadata_ columns) will be\n concatenated together for each **Library**.\n- After mapping, and prior BAM filtering, BAM files with different\n `SeqType` (but with all other metadata columns the same) will be merged\n together for each **Library**.\n- After duplicate removal, BAM files with `Library_ID`s with the same\n `Sample_Name` and the same `UDG_Treatment` will be merged together.\n- If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and\n half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the\n same `Sample_Name`.\n\nNote the following important points and limitations for setting up:\n\n- The TSV must use actual tabs (not spaces) between cells.\n- *File* names must be unique regardless of file path, due to risk of\n over-writing (see:\n [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)).\n - If it is 'too late' and you already have duplicate file names, a workaround is\n to concatenate the FASTQ files together and supply this to a nf-core/eager\n run. The only downside is that you will not get independent FASTQC results\n for each file.\n- Lane IDs must be unique for each sequencing of each library.\n - If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can\n give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will\n still be processed correctly.\n - This also applies to the SeqType column, i.e. with the example above, if one\n run is PE and one run is SE, you need to give fake lane IDs to one of the\n runs as well.\n- All _BAM_ files must be specified as `SE` under `SeqType`.\n - You should provide a small decoy reference genome with pre-made indices, e.g.\n the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in\n order to avoid long computational time for generating the index files of the\n reference genome, even if you do not actual need a reference genome for any\n downstream analyses.\n- nf-core/eager will only merge multiple _lanes_ of sequencing runs with the\n same single-end or paired-end configuration\n- Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files\n (unless you use `--run_convertbam`), as only FASTQ files are lane-merged\n together.\n- Same libraries that are sequenced on different sequencing configurations (i.e\n single- and paired-end data), will be merged after mapping and will _always_\n be considered 'paired-end' during downstream processes\n - **Important** running DeDup in this context is _not_ recommended, as PE and\n SE data at the same position will _not_ be evaluated as duplicates.\n Therefore not all duplicates will be removed.\n - When you wish to run PE/SE data together `-dedupper markduplicates` is\n therefore preferred.\n - An error will be thrown if you try to merge both PE and SE and also supply\n `--skip_merging`.\n - If you truly want to mix SE data and PE data but using mate-pair info for PE\n mapping, please run FASTQ preprocessing mapping manually and supply BAM\n files for downstream processing by nf-core/eager\n - If you _regularly_ want to run the situation above, please leave a feature\n request on github.\n- DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on\n each unique library separately after deduplication (but prior same-treated\n library merging).\n- nf-core/eager functionality such as `--run_trim_bam` will be applied to only\n non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries.\n- Qualimap is run on each sample, after merging of libraries (i.e. your values\n will reflect the values of all libraries combined - after being damage trimmed\n etc.).\n- Genotyping will be typically performed on each `sample` independently, as\n normally all libraries will have been merged together. However, if you have a\n mixture of single-stranded and double-stranded libraries, you will normally\n need to genotype separately. In this case you **must** give each the SS and DS\n libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file\n collision` error in steps such as `sexdeterrmine`, and then you will need to\n merge these yourself. We will consider changing this behaviour in the future\n if there is enough interest." + "help_text": "There are two possible ways of supplying input sequencing data to nf-core/eager. The most efficient but more simplistic is supplying direct paths (with wildcards) to your FASTQ or BAM files, with each file or pair being considered a single library and each one run independently (e.g. for paired-end data: `--input '///*_{R1,R2}_*.fq.gz'`). TSV input requires creation of an extra file by the user (`--input '///eager_data.tsv'`) and extra metadata, but allows more powerful lane and library merging. Please see [usage docs](https://nf-co.re/eager/docs/usage#input-specifications) for detailed instructions and specifications." }, "udg_type": { "type": "string", "default": "none", "description": "Specifies whether you have UDG treated libraries. Set to 'half' for partial treatment, or 'full' for UDG. If not set, libraries are assumed to have no UDG treatment ('none'). Not required for TSV input.", "fa_icon": "fas fa-vial", - "help_text": "Defines whether Uracil-DNA glycosylase (UDG) treatment was used to remove DNA\ndamage on the sequencing libraries.\n\nSpecify `'none'` if no treatment was performed. If you have partial UDG treated\ndata ([Rohland et al 2016](http://dx.doi.org/10.1098/rstb.2013.0624)), specify\n`'half'`. If you have complete UDG treated data ([Briggs et al.\n2010](https://doi.org/10.1093/nar/gkp1163)), specify `'full'`. \n\nWhen also using PMDtools specifying `'half'` will use a different model for DNA\ndamage assessment in PMDTools (PMDtools: `--UDGhalf`). Specify `'full'` and the\nPMDtools DNA damage assessment will use CpG context only (PMDtools: `--CpG`).\nDefault: `'none'`.\n\n> **Tip**: You should provide a small decoy reference genome with pre-made indices, e.g.\n> the human mtDNA genome, for the mandatory parameter `--fasta` in order to\n> avoid long computational time for generating the index files of the reference\n> genome, even if you do not actual need a reference genome for any downstream\n> analyses.", + "help_text": "Defines whether Uracil-DNA glycosylase (UDG) treatment was used to remove DNA\ndamage on the sequencing libraries.\n\nSpecify `'none'` if no treatment was performed. If you have partial UDG treated\ndata ([Rohland et al 2016](http://dx.doi.org/10.1098/rstb.2013.0624)), specify\n`'half'`. If you have complete UDG treated data ([Briggs et al.\n2010](https://doi.org/10.1093/nar/gkp1163)), specify `'full'`. \n\nWhen also using PMDtools specifying `'half'` will use a different model for DNA\ndamage assessment in PMDTools (PMDtools: `--UDGhalf`). Specify `'full'` and the\nPMDtools DNA damage assessment will use CpG context only (PMDtools: `--CpG`).\nDefault: `'none'`.\n\n> **Tip**: You should provide a small decoy reference genome with pre-made indices, e.g.\n> the human mtDNA genome, for the mandatory parameter `--fasta` in order to\n> avoid long computational time for generating the index files of the reference\n> genome, even if you do not actually need a reference genome for any downstream\n> analyses.", "enum": [ "none", "half", @@ -59,11 +59,6 @@ "help_text": "Specifies the input file type to `--input` is in BAM format. This will automatically also apply `--single_end`.\n\nOnly required when using the 'Path' method of `--input`.\n" } }, - "fa_icon": "fas fa-terminal", - "description": "Define where the pipeline should find input data, and additional metadata.", - "required": [ - "input" - ], "help_text": "There are two possible ways of supplying input sequencing data to nf-core/eager.\nThe most efficient but more simplistic is supplying direct paths (with\nwildcards) to your FASTQ or BAM files, with each file or pair being considered a\nsingle library and each one run independently. TSV input requires creation of an\nextra file by the user and extra metadata, but allows more powerful lane and\nlibrary merging." }, "input_data_additional_options": { @@ -155,7 +150,6 @@ "help_text": "Use this if you do not have pre-made reference FASTA indices for `bwa`, `samtools` and `picard`. If you turn this on, the indices nf-core/eager generates for you and will be saved in the `/results/reference_genomes` for you. If not supplied, nf-core/eager generated index references will be deleted.\n\n> modifies SAMtools index command: `-c`" } }, - "fa_icon": "fas fa-dna", "description": "Specify locations of references and optionally, additional pre-made indices", "help_text": "All nf-core/eager runs require a reference genome in FASTA format to map reads\nagainst to.\n\nIn addition we provide various options for indexing of different types of\nreference genomes (based on the tools used in the pipeline). nf-core/eager can\nindex reference genomes for you (with options to save these for other analysis),\nbut you can also supply your pre-made indices.\n\nSupplying pre-made indices saves time in pipeline execution and is especially\nadvised when running multiple times on the same cluster system for example. You\ncan even add a resource [specific profile](#profile) that sets paths to\npre-computed reference genomes, saving time when specifying these.\n\n> :warning: you must always supply a reference file. If you want to use\n functionality that does not require one, supply a small decoy genome such as\n phiX or the human mtDNA genome." }, @@ -311,7 +305,7 @@ "title": "Institutional config options", "type": "object", "fa_icon": "fas fa-university", - "description": "Parameters used to describe centralised config profiles. These should not be edited.", + "description": "Parameters used to describe centralised config profiles. These generally should not be edited.", "help_text": "The centralised nf-core configuration profiles use a handful of pipeline parameters to describe themselves. This information is then printed to the Nextflow log when you run a pipeline. You should not need to change these values when you run a pipeline.", "properties": { "custom_config_version": { @@ -370,10 +364,7 @@ "description": "Path to the AWS CLI tool", "fa_icon": "fab fa-aws" } - }, - "fa_icon": "fas fa-university", - "description": "Parameters used to describe centralised config profiles. These generally should not be edited.", - "help_text": "The centralised nf-core configuration profiles use a handful of pipeline parameters to describe themselves. This information is then printed to the Nextflow log when you run a pipeline. You should not need to change these values when you run a pipeline." + } }, "skip_steps": { "title": "Skip steps", @@ -1584,4 +1575,4 @@ "$ref": "#/definitions/metagenomic_authentication" } ] -} +} \ No newline at end of file