Adding shortread deduplication feature with fastp #439

maxibor · 2024-01-25T15:49:49Z

This PR adds deduplication of reads with fastp

PR checklist

This comment contains a description of changes (with reason).
Make sure your code lints (nf-core lint).
Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
CHANGELOG.md is updated.

Patch release V1.1.1

Fix CHANGELOG.md for patch release v1.1.1

1.1.2 release

1.1.3 Patch release

1.1.4 Patch release

github-actions · 2024-01-25T15:52:01Z

`nf-core lint` overall result: Failed ❌

Posted for pipeline commit 7e3f119

+| ✅ 183 tests passed       |+
#| ❔   1 tests were ignored |#
-| ❌  11 tests failed       |-

❌ Test failures:

files_exist - File must be removed: lib/nfcore_external_java_deps.jar
nextflow_config - Config default value incorrect: params.igenomes_base is set as s3://ngi-igenomes/igenomes in nextflow_schema.json but is s3://ngi-igenomes/igenomes/ in nextflow.config.
files_unchanged - .github/workflows/branch.yml does not match the template
files_unchanged - .github/workflows/linting_comment.yml does not match the template
files_unchanged - .github/workflows/linting.yml does not match the template
files_unchanged - assets/email_template.html does not match the template
files_unchanged - assets/email_template.txt does not match the template
files_unchanged - assets/nf-core-taxprofiler_logo_light.png does not match the template
files_unchanged - docs/images/nf-core-taxprofiler_logo_light.png does not match the template
files_unchanged - docs/images/nf-core-taxprofiler_logo_dark.png does not match the template
files_unchanged - pyproject.toml does not match the template

❔ Tests ignored:

files_unchanged - File does not exist: lib/nfcore_external_java_deps.jar

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-taxprofiler_logo_light.png
files_exist - File found: conf/modules.config
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-taxprofiler_logo_light.png
files_exist - File found: docs/images/nf-core-taxprofiler_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: lib/NfcoreTemplate.groovy
files_exist - File found: lib/Utils.groovy
files_exist - File found: lib/WorkflowMain.groovy
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yml
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: .github/workflows/awstest.yml
files_exist - File found: .github/workflows/awsfulltest.yml
files_exist - File found: lib/WorkflowTaxprofiler.groovy
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-taxprofiler_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 1.1.5dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.preprocessing_qc_tool
nextflow_config - Config default value correct: params.shortread_qc_tool
nextflow_config - Config default value correct: params.shortread_qc_minlength
nextflow_config - Config default value correct: params.shortread_complexityfilter_tool
nextflow_config - Config default value correct: params.shortread_complexityfilter_entropy
nextflow_config - Config default value correct: params.shortread_complexityfilter_bbduk_windowsize
nextflow_config - Config default value correct: params.shortread_complexityfilter_fastp_threshold
nextflow_config - Config default value correct: params.shortread_complexityfilter_prinseqplusplus_mode
nextflow_config - Config default value correct: params.shortread_complexityfilter_prinseqplusplus_dustscore
nextflow_config - Config default value correct: params.longread_qc_qualityfilter_minlength
nextflow_config - Config default value correct: params.longread_qc_qualityfilter_keeppercent
nextflow_config - Config default value correct: params.longread_qc_qualityfilter_targetbases
nextflow_config - Config default value correct: params.diamond_output_format
nextflow_config - Config default value correct: params.kaiju_taxon_rank
nextflow_config - Config default value correct: params.krakenuniq_ram_chunk_size
nextflow_config - Config default value correct: params.krakenuniq_batch_size
nextflow_config - Config default value correct: params.malt_mode
nextflow_config - Config default value correct: params.kmcp_mode
nextflow_config - Config default value correct: params.ganon_report_type
nextflow_config - Config default value correct: params.ganon_report_toppercentile
nextflow_config - Config default value correct: params.ganon_report_mincount
nextflow_config - Config default value correct: params.ganon_report_maxcount
nextflow_config - Config default value correct: params.standardisation_taxpasta_format
nextflow_config - Config default value correct: params.custom_config_version
nextflow_config - Config default value correct: params.custom_config_base
nextflow_config - Config default value correct: params.max_cpus
nextflow_config - Config default value correct: params.max_memory
nextflow_config - Config default value correct: params.max_time
nextflow_config - Config default value correct: params.publish_dir_mode
nextflow_config - Config default value correct: params.max_multiqc_email_size
nextflow_config - Config default value correct: params.validate_params
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - lib/NfcoreTemplate.groovy matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml does not use -profile test
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_todos - No TODO strings found
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (242 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: awsfulltest.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: awstest.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
multiqc_config - 'assets/multiqc_config.yml' contains report_section_order
multiqc_config - 'assets/multiqc_config.yml' contains export_plots
multiqc_config - 'assets/multiqc_config.yml' contains report_comment
multiqc_config - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins.
multiqc_config - 'assets/multiqc_config.yml' contains a matching 'report_comment'.
multiqc_config - 'assets/multiqc_config.yml' contains 'export_plots: true'.
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.12
Run at 2024-02-01 10:33:30

maxibor · 2024-01-25T16:49:24Z

An easy PR for you @jfy133 to offset #548 😉

jfy133

Indeed this is much easier 😬

Doc improvements but otherwise LGTM. @sofstam @LilyAnderssonLee should we add here as 1.5 or to bouncy basenji? It's not a bug fix exactly... but it's not a fully fledged new functionality addition?

We do have potentially teh centrifuge fix coming up (if I can fix it...) so there likely will be a 1.1.5

CHANGELOG.md

nextflow_schema.json

sofstam · 2024-01-26T07:58:46Z

I would say it is 1.1.5.

sofstam · 2024-01-26T07:59:19Z

Getting back with a review later today :)

Co-authored-by: James A. Fellows Yates <[email protected]>

Midnighter · 2024-01-28T10:01:47Z

I'm somewhat concerned with this feature. If I remember correctly, FASTQC and fastp use the first 50 and 75 bp, respectively, to judge read duplication. Using longer sequences would drive up memory requirements and take longer. So the first question is, are we truly only removing identical reads with this?

My second question comes from my inexperience with sequencing: If you have a dominant species in your metagenomic sample, how unlikely is it to have an identical read?

jfy133 · 2024-01-28T10:31:46Z

sing longer sequences would drive up memory requirements and take longer. So the first question is, are we truly only removing identical reads with this?

Does it really? the README at least seems to implies it's some condensed hash of the whole read: https://github.com/OpenGene/fastp#duplication-rate-evaluation. That said, it's opt-in so it's still up to the user to decide if it's a suitable algorithm

My second question comes from my inexperience with sequencing: If you have a dominant species in your metagenomic sample, how unlikely is it to have an identical read?

An absolutely exact duplicate is quite unlikely, as

fragmentation protocols should be random (with a slight preference breakages around GCs IIRC), so that in combination with (relatively) longer reads it's unlikely due to sequence diversity.
Exact duplicates are much more likely from lab-based amplicons as they use the same priming sequence, and given the number of amplification cycles also very likely to have copies from artifical duplicate rather than naturally occuring. At least in Illumina short-read protocols that is.

Midnighter · 2024-01-29T18:33:22Z

Thank you for your response, sounds good to me then. 👍🏼

nextflow_schema.json

LilyAnderssonLee and others added 7 commits October 11, 2023 08:49

Merge pull request nf-core#388 from nf-core/dev

4add6c9

Patch release V1.1.1

Merge pull request nf-core#399 from nf-core/dev

baede5b

Fix CHANGELOG.md for patch release v1.1.1

Merge pull request nf-core#411 from nf-core/dev

3d4eda2

1.1.2 release

Merge pull request nf-core#426 from nf-core/dev

df4e001

1.1.3 Patch release

Merge pull request nf-core#435 from nf-core/dev

21e6af8

1.1.4 Patch release

add fastp dedup

6c752a9

Merge branch 'fastp_dedup' into dev

837a239

maxibor requested review from sofstam and jfy133 and removed request for sofstam January 25, 2024 16:48

update changelog

2570c0f

jfy133 reviewed Jan 26, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

nextflow_schema.json Outdated Show resolved Hide resolved

nextflow_schema.json Outdated Show resolved Hide resolved

maxibor and others added 3 commits January 26, 2024 09:09

Update CHANGELOG.md

e6fdf57

Co-authored-by: James A. Fellows Yates <[email protected]>

Update nextflow_schema.json

6867467

Co-authored-by: James A. Fellows Yates <[email protected]>

Update nextflow_schema.json

47edb12

Co-authored-by: James A. Fellows Yates <[email protected]>

sofstam approved these changes Jan 26, 2024

View reviewed changes

sofstam and others added 2 commits January 26, 2024 17:04

Merge branch 'dev' into dedup

1fed057

Prettier

ea35f93

jfy133 reviewed Feb 1, 2024

View reviewed changes

nextflow_schema.json Outdated Show resolved Hide resolved

Apply suggestions from code review

7e3f119

jfy133 merged commit 0792d91 into nf-core:dev Feb 1, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding shortread deduplication feature with fastp #439

Adding shortread deduplication feature with fastp #439

maxibor commented Jan 25, 2024 •

edited

Loading

github-actions bot commented Jan 25, 2024 •

edited

Loading

❌ Test failures:

❔ Tests ignored:

✅ Tests passed:

Run details

maxibor commented Jan 25, 2024

jfy133 left a comment

sofstam commented Jan 26, 2024

sofstam commented Jan 26, 2024

Midnighter commented Jan 28, 2024

jfy133 commented Jan 28, 2024 •

edited

Loading

Midnighter commented Jan 29, 2024

Adding shortread deduplication feature with fastp #439

Adding shortread deduplication feature with fastp #439

Conversation

maxibor commented Jan 25, 2024 • edited Loading

PR checklist

github-actions bot commented Jan 25, 2024 • edited Loading

nf-core lint overall result: Failed ❌

❌ Test failures:

❔ Tests ignored:

✅ Tests passed:

Run details

maxibor commented Jan 25, 2024

jfy133 left a comment

Choose a reason for hiding this comment

sofstam commented Jan 26, 2024

sofstam commented Jan 26, 2024

Midnighter commented Jan 28, 2024

jfy133 commented Jan 28, 2024 • edited Loading

Midnighter commented Jan 29, 2024

maxibor commented Jan 25, 2024 •

edited

Loading

github-actions bot commented Jan 25, 2024 •

edited

Loading

`nf-core lint` overall result: Failed ❌

jfy133 commented Jan 28, 2024 •

edited

Loading