Dada2 MergePairs : consensus between merging & concatenating reads #803

weber8thomas · 2024-11-20T13:23:38Z

Authors

@weber8thomas - Thomas Weber (EMBL Heidelberg)- Technical implementation, method parameter configuration accessible from the pipeline config, testing, and ensuring compliance with nf-core requirements.
@nhenry50 - Nicolas Henry (CNRS - ABiMS) - Concept & Methods development, algorithm design, benchmarking
@lplanat - Laurine Planat (EMBL Heidelberg)- Testing, feedback provision, benchmarking
@FloraVincent - Flora Vincent (EMBL Heidelberg) - Conceptualisation, project supervision

Context & description

This pull request introduces an enhancement to the DADA2::mergePairs() process within the pipeline, enabling conditional merge or concatenation of sequences based on the overlap between forward and reverse reads. Previously, the pipeline allowed only to use either one of the two methods (merging or concatenating) of paired-end reads, via the --concatenate_reads parameter. This enhancement introduces the "consensus" method, allowing the pipeline to dynamically determine the appropriate method—merging or concatenation—thereby improving sequence assembly accuracy and downstream analysis outcomes.

The core enhancement revolves around incorporating conditional logic to assess the overlap between paired-end reads and decide whether to merge or concatenate them. This decision is based on a specified overlap threshold, ensuring that only reads with adequate overlap are merged, while others are concatenated with a defined spacer.

Enhancement Highlights:

Dual Invocation of mergePairs:
- Merging: Invoked with justConcatenate = FALSE to attempt merging where possible.
- Concatenation: Invoked with justConcatenate = TRUE to concatenate reads where merging isn't feasible.
Overlap Threshold Calculation:
- Calculates a minimum overlap threshold (min_overlap_obs) based on accepted mergers.
- (By default) Utilizes the 0.1th percentile (quantile(min_overlap_obs, 0.001)) to determine a stringent cutoff (can be customised through asv_percentile_cutoff). This ensures that only read pairs with exceptionally high overlap are merged into consensus sequences, while those with insufficient overlap are concatenated, thereby maintaining sequence accuracy and integrity.
Conditional Replacement:
- Iterates through each sample's mergers.
- Replaces non-accepted mergers with concatenated sequences if the overlap falls below the threshold.
- Filters out any non-accepted, non-concatenated sequences to maintain data integrity.

Parameters changed or introduced

mergepairs_strategy (former concatenate_reads ; expect now a string as value)
- Options:
  - "merge" (former TRUE): Enable concatenation (already existing)
  - "concatenate" (former FALSE): Disable concatenation (already existing)
  - "consensus": Enables conditional merging or concatenation based on the overlap between reads.

By default, when setting mergepairs_strategy to merge or concatenate, fixed generic values will still be used regarding the paired-reads alignment (see below):

match = 1, mismatch = -64, gap = -64, minOverlap = 12, maxMismatch = 0

However, by setting mergepairs_strategy to consensus, we offer now the possibility to tweak those values in order to adjust the alignment done between the reads of the same pair. This can be done my modifying the following newly introduced parameters:

mergepairs_consensus_minoverlap
- Description: Sets the minimum required overlap length for merging paired-end reads.
- Default Value: 12
- Usage: Determines the threshold below which reads will not be merged and may be concatenated instead.
mergepairs_consensus_maxmismatch
- Description: Defines the maximum allowed mismatches during the merging process.
- Default Value: 0
- Usage: Controls the stringency of the merging criteria by limiting the number of mismatches permitted between overlapping regions.
mergepairs_consensus_gap
- Description: Specifies the gap penalty used during the alignment process in merging.
- Default Value: -4
- Usage: Influences the alignment algorithm's handling of gaps, affecting the quality of merged sequences.
mergepairs_consensus_match
- Description: Sets the match score for the alignment algorithm.
- Default Value: 1
- Usage: Determines the scoring for matching bases during the alignment, impacting the alignment sensitivity.
mergepairs_consensus_mismatch
- Description: Sets the mismatch penalty for the alignment algorithm.
- Default Value: -2
- Usage: Determines the penalty for mismatched bases during alignment, affecting the alignment specificity.
mergepairs_consensus_percentile_cutoff
- Description: Sets the percentile cutoff determining the minimum observed overlap in the dataset.
- Default Value: 0.001
- Usage: This ensures that only read pairs with high overlap are merged into consensus sequences. Those with insufficient overlap are concatenated.

This approach is particularly beneficial for datasets containing both prokaryotic and eukaryotic sequences, where lengths vary, leading to differing overlap extents.

PR checklist

d4straub

Thanks, that looks much cleaner!

(1) When I run

nextflow run weber8thomas/ampliseq -r consensus-pr -profile test,singularity --outdir results_consensus-pr -resume --skip_qiime

the warning

WARN: Access to undefined parameter `concatenate_reads` -- Initialise it to a default value eg. `params.concatenate_reads = some_value

(2) when appending --asv_concatenate_reads I get the warning

ERROR ~ Validation of pipeline parameters failed!

 -- Check '.nextflow.log' file for details
The following invalid input values have been detected:

* --asv_concatenate_reads (true): Expected any of [[false, true, consensus]]

(3) appending --concatenate_reads "consensus" results into

ERROR ~ Unknown config attribute `params.match` -- check config file: /home/daniel/.nextflow/assets/weber8thomas/ampliseq/nextflow.config

I think all problems should be solved by my comments.
Edit: thats not true, all occurrences of concatenate_reads have to be replaced by asv_concatenate_reads to solve (1)

conf/modules.config

modules/local/dada2_denoising.nf

nextflow_schema.json

…r definition in the process , update CHANGELOG, fix asv_concatenate_reads definition, fix default asv_match & asv_mismatch values

weber8thomas · 2024-11-20T15:06:07Z

Thanks, that looks much cleaner!

(1) When I run
nextflow run weber8thomas/ampliseq -r consensus-pr -profile test,singularity --outdir results_consensus-pr -resume --skip_qiime
the warning
WARN: Access to undefined parameter `concatenate_reads` -- Initialise it to a default value eg. `params.concatenate_reads = some_value
(2) when appending --asv_concatenate_reads I get the warning
ERROR ~ Validation of pipeline parameters failed!

 -- Check '.nextflow.log' file for details
The following invalid input values have been detected:

* --asv_concatenate_reads (true): Expected any of [[false, true, consensus]]
(3) appending --concatenate_reads "consensus" results into
ERROR ~ Unknown config attribute `params.match` -- check config file: /home/daniel/.nextflow/assets/weber8thomas/ampliseq/nextflow.config
I think all problems should be solved by my comments. Edit: thats not true, all occurrences of concatenate_reads have to be replaced by asv_concatenate_reads to solve (1)

Thanks for the feedback @d4straub ! I implemented your suggestions and the execution worked on my side with and without --asv_concatenate_reads consensus

EDIT : just noticed I have an issue with how match & mismatch values should be defined, here is the original code from @nhenry50 below:

        ext.args2 = [
            'minOverlap = 12, maxMismatch = 0, propagateCol = character(0), gap = -64, homo_gap = NULL, endsfree = TRUE, vec = FALSE',
            params.concatenate_reads == "consensus" ? "returnRejects = TRUE, match = 5, mismatch = -6" :
                params.concatenate_reads == "concatenate" ? "justConcatenate = TRUE, returnRejects = FALSE, match = 1, mismatch = -64" :
                "justConcatenate = FALSE, returnRejects = FALSE, match = 1, mismatch = -64"
        ].join(',').replaceAll('(,)*$', "")

How could I define default values to match = 1, mismatch = -64 if consensus is not used, and match = 5, mismatch = -6 if used ?

d4straub · 2024-11-21T09:47:59Z

How could I define default values to match = 1, mismatch = -64 if consensus is not used, and match = 5, mismatch = -6 if used ?

That might be an issue. I can currently only think of using two params (introducing one more), which will complicate things, or remove the params and fix the values in the config. That way the values are still modifiable using a config, but less accessible for the standard user because working with configs requires more knowledge. Therefore, if those values do not need to be changed all the time and the values you propose here are usually fine, I'd say remove the params and put the values in the config (e.g. mismatch = -64 instead of mismatch = ${params.asv_mismatch} and remove --asv_mismatch altogether).

…eters when enabled

weber8thomas · 2024-11-21T10:07:52Z

How could I define default values to match = 1, mismatch = -64 if consensus is not used, and match = 5, mismatch = -6 if used ?

That might be an issue. I can currently only think of using two params (introducing one more), which will complicate things, or remove the params and fix the values in the config. That way the values are still modifiable using a config, but less accessible for the standard user because working with configs requires more knowledge. Therefore, if those values do not need to be changed all the time and the values you propose here are usually fine, I'd say remove the params and put the values in the config (e.g. mismatch = -64 instead of mismatch = ${params.asv_mismatch}).

Actually, match = 1, mismatch = -64 should be fixed when consensus is not used but we would like to give the possibility to adjust values when consensus is enabled. Would the last commit be okay in that regard? The user would only be able to play with parameters like asv_match asv_mismatch ... when --asv_concatenate_reads consensus is set.

d4straub · 2024-11-21T11:33:57Z

Would the last commit be okay in that regard?

Yes, but the documentation should be adjusted accordingly. And the parameter names are also not really fitting any more.

weber8thomas · 2024-11-21T12:11:57Z

Would the last commit be okay in that regard?

Yes, but the documentation should be adjusted accordingly. And the parameter names are also not really fitting any more.

Will work on this! Which prefix would you suggest regarding the parameter names : asv_consensus_, asv_mergepairs_consensus_, mergepairs_consensus_, denoising_consensus_ ?

d4straub · 2024-11-21T12:19:33Z

mergepairs_consensus_ sounds good I think. Thats precise. Could you also rename the asv_ prefix? mergepairs_ seems better suited, since it only affects PE reads!?
And asv_concatenate_reads seems now that it has 3 choices also imperfect. Maybe just mergepairs, or mergepairs_type or such? Not sure, I'm sure you'll find a intuitive name.

…sensus_" and change "asv_concatenate_reads" to "mergepairs_strategy"

weber8thomas · 2024-11-21T12:44:35Z

mergepairs_consensus_ sounds good I think. Thats precise. Could you also rename the asv_ prefix? mergepairs_ seems better suited, since it only affects PE reads!? And asv_concatenate_reads seems now that it has 3 choices also imperfect. Maybe just mergepairs, or mergepairs_type or such? Not sure, I'm sure you'll find a intuitive name.

Thanks for the feedback! :)

I updated all the different parameters introduced with prefix mergepairs_consensus_ and changed asv_concatenate_reads to mergepairs_strategy, I think it capture properly the idea of that one.

conf/modules.config

d4straub

I like --mergepairs_strategy !

Ok, almost there, I still found a few points. If you dont mind those can be also added by going into the "Files changed" tab and click on "Add suggestion to batch" and then committing those.

After that additions it should work as expected and I can test again.

The last piece would be to fix the formatting here with @nf-core-bot fix linting

nextflow_schema.json

modules/local/dada2_denoising.nf

conf/modules.config

nf-core-bot · 2024-12-20T15:15:47Z

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.0.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

d4straub · 2024-12-20T15:24:44Z

@nf-core-bot fix linting

nextflow_schema.json

d4straub

Dear @weber8thomas , that PR seems ready to be merged now. Only missing piece to perfection is testing. You mention that you did benchmarking, is there any data that you can share? The test data here would be only for testing technically if everything is fine. Do you think any paired-end data would be sufficient? If yes, then I would simply use --mergepairs_strategy consensus in any already existing test run.

weber8thomas · 2025-01-09T13:31:40Z

Dear @weber8thomas , that PR seems ready to be merged now. Only missing piece to perfection is testing. You mention that you did benchmarking, is there any data that you can share? The test data here would be only for testing technically if everything is fine. Do you think any paired-end data would be sufficient? If yes, then I would simply use --mergepairs_strategy consensus in any already existing test run.

Hi @d4straub , sorry for the late reply (and happy new year 🥳 ) We generated additional benchmarks in order to validate and justify custom values we are using for Dada2 alignemnt (gap, mismatch, match), especially using a more robust grid search. I'll share this soon.
Regarding the test dataset, any PE data should be sufficient to technically validate the function is working fine.
Just out of curiosity, would there be any light datasets containing both eukaryotes and prokaryotes we could leverage in ampliseq test datasets repo?
Thanks!

d4straub · 2025-01-09T13:43:12Z

I'll share this soon.

Great, looking forward!

Regarding the test dataset, any PE data should be sufficient to technically validate the function is working fine.

Ok, I'll do that then in a separate PR.

Just out of curiosity, would there be any light datasets containing both eukaryotes and prokaryotes we could leverage in ampliseq test datasets repo?

I am actually not sure, several people added test datasets and I never looked at them in that way.

d4straub · 2025-01-14T11:34:19Z

Regarding the test dataset, any PE data should be sufficient to technically validate the function is working fine.

Alright, I'm going to activate --mergepairs_strategy consensus in at least one test profile in a separate PR

Just out of curiosity, would there be any light datasets containing both eukaryotes and prokaryotes we could leverage in ampliseq test datasets repo?

I am not sure, I dont think so.

I would prefer to merge that PR now so that I can work towards a release (whatever little time I have for that). Any objections?

weber8thomas · 2025-01-14T14:20:25Z

Regarding the test dataset, any PE data should be sufficient to technically validate the function is working fine.

Alright, I'm going to activate --mergepairs_strategy consensus in at least one test profile in a separate PR

Just out of curiosity, would there be any light datasets containing both eukaryotes and prokaryotes we could leverage in ampliseq test datasets repo?

I am not sure, I dont think so.

I would prefer to merge that PR now so that I can work towards a release (whatever little time I have for that). Any objections?

Hi @d4straub , I'm just waiting for validation from @nhenry50 and @FloraVincent, will update you asap!

…_schema.json)

weber8thomas · 2025-01-14T14:50:25Z

I just updated the values accordingly to the best performance in our benchmark, Flora & Nicolas validated the merge. You're good to go! :)

d4straub · 2025-01-15T07:05:30Z

modules/local/dada2_denoising.nf

+            mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, $args2, verbose=TRUE)
+        }
+
+        saveRDS(mergers, "${meta.run}.mergers.rds")


Suggested change

saveRDS(mergers, "${meta.run}.mergers.rds")

saveRDS(mergers, "${prefix}.mergers.rds")

better late than never, this seems to me as a breaking change under certain circumstances, was this intentional?

feat: consensus method for dada2_denoising

ec57c84

weber8thomas mentioned this pull request Nov 20, 2024

Dada2 MergePairs : consensus between merging & concatenating reads #799

Closed

11 tasks

d4straub requested changes Nov 20, 2024

View reviewed changes

remove default values from modules.config, refactor quantile paramete…

cc338b0

…r definition in the process , update CHANGELOG, fix asv_concatenate_reads definition, fix default asv_match & asv_mismatch values

weber8thomas added 2 commits November 20, 2024 15:16

rollback to correct match & mismatch values

747435d

fix: default value in schema not valid

8e46db7

chore: set fixed parameters when consensus disabled, use config param…

9ee15fe

…eters when enabled

chore: update newly introduced parameters prefix to : "mergepairs_con…

e073f97

…sensus_" and change "asv_concatenate_reads" to "mergepairs_strategy"

d4straub reviewed Nov 21, 2024

View reviewed changes

conf/modules.config Outdated Show resolved Hide resolved

d4straub reviewed Nov 21, 2024

View reviewed changes

conf/modules.config Outdated Show resolved Hide resolved

d4straub reviewed Nov 21, 2024

View reviewed changes

conf/modules.config Outdated Show resolved Hide resolved

weber8thomas and others added 2 commits November 21, 2024 12:49

fix: fix wrongly replaced values

a9ed8c4

Merge branch 'dev' into consensus-pr

c3e4165

d4straub reviewed Nov 21, 2024

View reviewed changes

nextflow_schema.json Show resolved Hide resolved

nextflow_schema.json Outdated Show resolved Hide resolved

modules/local/dada2_denoising.nf Outdated Show resolved Hide resolved

conf/modules.config Outdated Show resolved Hide resolved

Apply suggestions from code review

11e8c7e

Merge branch 'dev' into consensus-pr

d267dd5

[automated] Fix code linting

57b7a24

d4straub reviewed Dec 20, 2024

View reviewed changes

nextflow_schema.json Outdated Show resolved Hide resolved

Update nextflow_schema.json

f8a44ad

d4straub approved these changes Dec 20, 2024

View reviewed changes

Merge branch 'dev' into consensus-pr

433c9a3

weber8thomas added 2 commits January 14, 2025 14:47

chore: update default values for consensus

871703b

chore: update default values for consensus (missing match in nextflow…

845aac4

…_schema.json)

d4straub reviewed Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dada2 MergePairs : consensus between merging & concatenating reads #803

Dada2 MergePairs : consensus between merging & concatenating reads #803

weber8thomas commented Nov 20, 2024 •

edited

Loading

d4straub left a comment •

edited

Loading

weber8thomas commented Nov 20, 2024 •

edited

Loading

d4straub commented Nov 21, 2024 •

edited

Loading

weber8thomas commented Nov 21, 2024

d4straub commented Nov 21, 2024

weber8thomas commented Nov 21, 2024 •

edited

Loading

d4straub commented Nov 21, 2024

weber8thomas commented Nov 21, 2024

d4straub left a comment

nf-core-bot commented Dec 20, 2024

d4straub commented Dec 20, 2024

d4straub left a comment

weber8thomas commented Jan 9, 2025

d4straub commented Jan 9, 2025

d4straub commented Jan 14, 2025

weber8thomas commented Jan 14, 2025

weber8thomas commented Jan 14, 2025

d4straub Jan 15, 2025

	saveRDS(mergers, "${meta.run}.mergers.rds")
	saveRDS(mergers, "${prefix}.mergers.rds")

Dada2 MergePairs : consensus between merging & concatenating reads #803

Are you sure you want to change the base?

Dada2 MergePairs : consensus between merging & concatenating reads #803

Conversation

weber8thomas commented Nov 20, 2024 • edited Loading

Authors

Context & description

Enhancement Highlights:

Parameters changed or introduced

PR checklist

d4straub left a comment • edited Loading

Choose a reason for hiding this comment

weber8thomas commented Nov 20, 2024 • edited Loading

d4straub commented Nov 21, 2024 • edited Loading

weber8thomas commented Nov 21, 2024

d4straub commented Nov 21, 2024

weber8thomas commented Nov 21, 2024 • edited Loading

d4straub commented Nov 21, 2024

weber8thomas commented Nov 21, 2024

d4straub left a comment

Choose a reason for hiding this comment

nf-core-bot commented Dec 20, 2024

d4straub commented Dec 20, 2024

d4straub left a comment

Choose a reason for hiding this comment

weber8thomas commented Jan 9, 2025

d4straub commented Jan 9, 2025

d4straub commented Jan 14, 2025

weber8thomas commented Jan 14, 2025

weber8thomas commented Jan 14, 2025

d4straub Jan 15, 2025

Choose a reason for hiding this comment

weber8thomas commented Nov 20, 2024 •

edited

Loading

d4straub left a comment •

edited

Loading

weber8thomas commented Nov 20, 2024 •

edited

Loading

d4straub commented Nov 21, 2024 •

edited

Loading

weber8thomas commented Nov 21, 2024 •

edited

Loading