-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dada2 MergePairs : consensus between merging & concatenating reads #803
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that looks much cleaner!
(1) When I run
nextflow run weber8thomas/ampliseq -r consensus-pr -profile test,singularity --outdir results_consensus-pr -resume --skip_qiime
the warning
WARN: Access to undefined parameter `concatenate_reads` -- Initialise it to a default value eg. `params.concatenate_reads = some_value
(2) when appending --asv_concatenate_reads
I get the warning
ERROR ~ Validation of pipeline parameters failed!
-- Check '.nextflow.log' file for details
The following invalid input values have been detected:
* --asv_concatenate_reads (true): Expected any of [[false, true, consensus]]
(3) appending --concatenate_reads "consensus"
results into
ERROR ~ Unknown config attribute `params.match` -- check config file: /home/daniel/.nextflow/assets/weber8thomas/ampliseq/nextflow.config
I think all problems should be solved by my comments.
Edit: thats not true, all occurrences of concatenate_reads
have to be replaced by asv_concatenate_reads
to solve (1)
…r definition in the process , update CHANGELOG, fix asv_concatenate_reads definition, fix default asv_match & asv_mismatch values
Thanks for the feedback @d4straub ! I implemented your suggestions and the execution worked on my side with and without EDIT : just noticed I have an issue with how match & mismatch values should be defined, here is the original code from @nhenry50 below: ext.args2 = [
'minOverlap = 12, maxMismatch = 0, propagateCol = character(0), gap = -64, homo_gap = NULL, endsfree = TRUE, vec = FALSE',
params.concatenate_reads == "consensus" ? "returnRejects = TRUE, match = 5, mismatch = -6" :
params.concatenate_reads == "concatenate" ? "justConcatenate = TRUE, returnRejects = FALSE, match = 1, mismatch = -64" :
"justConcatenate = FALSE, returnRejects = FALSE, match = 1, mismatch = -64"
].join(',').replaceAll('(,)*$', "") How could I define default values to match = 1, mismatch = -64 if consensus is not used, and match = 5, mismatch = -6 if used ? |
That might be an issue. I can currently only think of using two params (introducing one more), which will complicate things, or remove the params and fix the values in the config. That way the values are still modifiable using a config, but less accessible for the standard user because working with configs requires more knowledge. Therefore, if those values do not need to be changed all the time and the values you propose here are usually fine, I'd say remove the params and put the values in the config (e.g. |
…eters when enabled
Actually, match = 1, mismatch = -64 should be fixed when consensus is not used but we would like to give the possibility to adjust values when consensus is enabled. Would the last commit be okay in that regard? The user would only be able to play with parameters like |
Yes, but the documentation should be adjusted accordingly. And the parameter names are also not really fitting any more. |
Will work on this! Which prefix would you suggest regarding the parameter names : asv_consensus_, asv_mergepairs_consensus_, mergepairs_consensus_, denoising_consensus_ ? |
|
…sensus_" and change "asv_concatenate_reads" to "mergepairs_strategy"
Thanks for the feedback! :) I updated all the different parameters introduced with prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like --mergepairs_strategy
!
Ok, almost there, I still found a few points. If you dont mind those can be also added by going into the "Files changed" tab and click on "Add suggestion to batch" and then committing those.
After that additions it should work as expected and I can test again.
The last piece would be to fix the formatting here with @nf-core-bot fix linting
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.0.2. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
@nf-core-bot fix linting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @weber8thomas , that PR seems ready to be merged now. Only missing piece to perfection is testing. You mention that you did benchmarking, is there any data that you can share? The test data here would be only for testing technically if everything is fine. Do you think any paired-end data would be sufficient? If yes, then I would simply use --mergepairs_strategy consensus
in any already existing test run.
Hi @d4straub , sorry for the late reply (and happy new year 🥳 ) We generated additional benchmarks in order to validate and justify custom values we are using for Dada2 alignemnt (gap, mismatch, match), especially using a more robust grid search. I'll share this soon. |
Great, looking forward!
Ok, I'll do that then in a separate PR.
I am actually not sure, several people added test datasets and I never looked at them in that way. |
Alright, I'm going to activate
I am not sure, I dont think so. I would prefer to merge that PR now so that I can work towards a release (whatever little time I have for that). Any objections? |
Hi @d4straub , I'm just waiting for validation from @nhenry50 and @FloraVincent, will update you asap! |
I just updated the values accordingly to the best performance in our benchmark, Flora & Nicolas validated the merge. You're good to go! :) |
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, $args2, verbose=TRUE) | ||
} | ||
|
||
saveRDS(mergers, "${meta.run}.mergers.rds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saveRDS(mergers, "${meta.run}.mergers.rds") | |
saveRDS(mergers, "${prefix}.mergers.rds") |
better late than never, this seems to me as a breaking change under certain circumstances, was this intentional?
Authors
Context & description
This pull request introduces an enhancement to the
DADA2::mergePairs()
process within the pipeline, enabling conditional merge or concatenation of sequences based on the overlap between forward and reverse reads. Previously, the pipeline allowed only to use either one of the two methods (merging or concatenating) of paired-end reads, via the--concatenate_reads
parameter. This enhancement introduces the "consensus" method, allowing the pipeline to dynamically determine the appropriate method—merging or concatenation—thereby improving sequence assembly accuracy and downstream analysis outcomes.The core enhancement revolves around incorporating conditional logic to assess the overlap between paired-end reads and decide whether to merge or concatenate them. This decision is based on a specified overlap threshold, ensuring that only reads with adequate overlap are merged, while others are concatenated with a defined spacer.
Enhancement Highlights:
justConcatenate = FALSE
to attempt merging where possible.justConcatenate = TRUE
to concatenate reads where merging isn't feasible.min_overlap_obs
) based on accepted mergers.quantile(min_overlap_obs, 0.001)
) to determine a stringent cutoff (can be customised throughasv_percentile_cutoff
). This ensures that only read pairs with exceptionally high overlap are merged into consensus sequences, while those with insufficient overlap are concatenated, thereby maintaining sequence accuracy and integrity.Parameters changed or introduced
mergepairs_strategy
(formerconcatenate_reads
; expect now a string as value)By default, when setting
mergepairs_strategy
tomerge
orconcatenate
, fixed generic values will still be used regarding the paired-reads alignment (see below):However, by setting
mergepairs_strategy
toconsensus
, we offer now the possibility to tweak those values in order to adjust the alignment done between the reads of the same pair. This can be done my modifying the following newly introduced parameters:mergepairs_consensus_minoverlap
mergepairs_consensus_maxmismatch
mergepairs_consensus_gap
mergepairs_consensus_match
mergepairs_consensus_mismatch
mergepairs_consensus_percentile_cutoff
This approach is particularly beneficial for datasets containing both prokaryotic and eukaryotic sequences, where lengths vary, leading to differing overlap extents.
PR checklist
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).