-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding shortread deduplication feature with fastp #439
Conversation
Patch release V1.1.1
Fix CHANGELOG.md for patch release v1.1.1
1.1.2 release
1.1.3 Patch release
1.1.4 Patch release
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this is much easier 😬
Doc improvements but otherwise LGTM. @sofstam @LilyAnderssonLee should we add here as 1.5 or to bouncy basenji? It's not a bug fix exactly... but it's not a fully fledged new functionality addition?
We do have potentially teh centrifuge fix coming up (if I can fix it...) so there likely will be a 1.1.5
I would say it is 1.1.5. |
Getting back with a review later today :) |
Co-authored-by: James A. Fellows Yates <[email protected]>
Co-authored-by: James A. Fellows Yates <[email protected]>
Co-authored-by: James A. Fellows Yates <[email protected]>
I'm somewhat concerned with this feature. If I remember correctly, FASTQC and fastp use the first 50 and 75 bp, respectively, to judge read duplication. Using longer sequences would drive up memory requirements and take longer. So the first question is, are we truly only removing identical reads with this? My second question comes from my inexperience with sequencing: If you have a dominant species in your metagenomic sample, how unlikely is it to have an identical read? |
Does it really? the README at least seems to implies it's some condensed hash of the whole read: https://github.com/OpenGene/fastp#duplication-rate-evaluation. That said, it's opt-in so it's still up to the user to decide if it's a suitable algorithm
An absolutely exact duplicate is quite unlikely, as
|
Thank you for your response, sounds good to me then. 👍🏼 |
This PR adds deduplication of reads with fastp
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).CHANGELOG.md
is updated.