Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add modules outlined in the pipeline proposal #8

Open
2 of 17 tasks
kedhammar opened this issue Mar 19, 2024 · 6 comments
Open
2 of 17 tasks

Add modules outlined in the pipeline proposal #8

kedhammar opened this issue Mar 19, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@kedhammar
Copy link

kedhammar commented Mar 19, 2024

Functionalities and modules

Mentioned in the pipeline proposal

primaryQC_pipeline_proposal.pdf

Pipeline proposal Slack thread

Standard QC

  • FastQC Standard QC
  • SeqKit histograms
    • I assume this refers to seqkit watch
    • Not available as a module currently, what I can see
  • SeqFu

Duplication + Complexity

  • Preseq complexity
    • Which subtool?
  • BBtools Clumpify
  • UMI detection (stretch goal)

Adapter and Artifact detection

  • Fastp
  • BBtools
    • BBDuk
    • Testformat2
    • (RQCFilter2 is a corresponding subworkflow using multiple BBtools)
    • (For PacBio: Removesmartbell, Icecreamfinder)

Contamination detection

  • FastQ screen
  • Sylph
  • Kraken2
  • Mapping to reference

Mentioned in the pipeline Slack channel

@kedhammar kedhammar added the enhancement New feature or request label Mar 19, 2024
@kedhammar
Copy link
Author

#6 PR draft to start adding modules

@kedhammar kedhammar mentioned this issue Mar 19, 2024
11 tasks
@mahesh-panchal
Copy link
Member

For WGS data for assembly, GenomeScope (https://github.com/nf-core/modules/blob/master/modules/nf-core/genomescope2/main.nf). The database is built using Meryl ( also on nf-core ).

But there is also a container only version that's a little bit faster and has extra tools that might be useful (https://github.com/nf-core/modules/blob/master/modules/nf-core/genescopefk/main.nf)
The databases for Merquryfk/KATGC, Merquryfk/KATCOMP, Merqury/Ploidyplot, and GeneScopefk are build using FastK.

@remiolsen
Copy link
Member

remiolsen commented Mar 19, 2024

Preseq complexity (which subtool?).

I've used preseq lc_extrap before and there's a module for it in nf-core (https://nf-co.re/modules/preseq_lcextrap). However, it is very prone to not working or rather refusing to give a complexity estimate.

Another option would be Picard (https://gatk.broadinstitute.org/hc/en-us/articles/360037591931-EstimateLibraryComplexity-Picard). I've never used it, and for the applications I worry about library complexity (HiC) the tool I use (pairtools) implemented it's own complexity estimate, so I have no need. There's no nf-core module for it as far as I can see.

@kedhammar
Copy link
Author

Preseq complexity (which subtool?).

I've used preseq lc_extrap before and there's a module for it in nf-core (https://nf-co.re/modules/preseq_lcextrap). However, it is very prone to not working or rather refusing to give a complexity estimate.

Another option would be Picard (https://gatk.broadinstitute.org/hc/en-us/articles/360037591931-EstimateLibraryComplexity-Picard). I've never used it, and for the applications I worry about library complexity (HiC) the tool I use (pairtools) implemented it's own complexity estimate, so I have no need. There's no nf-core module for it as far as I can see.

@remiolsen any idea why preseq lc_extrap tends to refuse?

@remiolsen
Copy link
Member

@remiolsen any idea why preseq lc_extrap tends to refuse?

I'm fairly certain I used to see this error most commonly - and I quote from the preseq manual

Q — When running lc extrap, I receive the error
ERROR: too many iterations, poor sample

A. — Most commonly this is due to the presence of defects in the approximation which cause the
estimates to be unstable. Setting the step size larger (with the flag -s) will help to avoid the
defects. The default step size is 1M reads or 0.05% of the input sample size rounded up to the
nearest million, whichever is larger. A consequence of this action will be a reduction in the
observed smoothness of the curve.

And setting the step -s flag was a little bit hit or miss if it worked.

@kedhammar
Copy link
Author

Closed #6 due to being too broad and unspecific. Feel free to start new PRs addressing more specific implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

4 participants