Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require a minimum sequence length of 50bp after ITSx #718

Merged
merged 1 commit into from
Mar 26, 2024

Conversation

d4straub
Copy link
Collaborator

@d4straub d4straub commented Mar 25, 2024

Sequences below 50bp cannot be used by DADA2 and most likely also other kmer-based classification methods fail with too short sequences (see https://github.com/benjjneb/dada2/issues/601). This can be usually easily prevented with --min_len_asv, but not when ITSx is used (because this is after --min_len_asv is appplied). Therefore, another length filtering step with 50bp min length is added here. The minimum value of 50 is not represented by a parameter but can be adjusted via a config, if required.

Closes #704.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/ampliseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@d4straub d4straub requested a review from jtangrot March 25, 2024 15:26
Copy link

github-actions bot commented Mar 25, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit f15c9dd

+| ✅ 191 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗   1 tests had warnings |!

❗ Test warnings:

  • readme - README did not have a Nextflow minimum version badge.

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.13.1
  • Run at 2024-03-25 15:29:11

Copy link
Member

@erikrikarddaniel erikrikarddaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@d4straub
Copy link
Collaborator Author

Thanks!

@d4straub d4straub merged commit c828ef4 into nf-core:dev Mar 26, 2024
17 checks passed
@d4straub d4straub deleted the fix-itsx-short-sequences branch March 26, 2024 08:01
Copy link
Contributor

@jtangrot jtangrot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks! My only concern is that the length filtering is generally applied, and not only when running DADA2 for taxonomies. E.g. sintax by default discards sequences shorter than 32, not 50, but this can be changed with a parameter and I think there is no requirement on sequence length (not going into details on how short a sequence can be to achieve sensible results). So for sintax a sequence length of 50 is not technically needed. That being said, I guess it's easy to change the filtering cutoff in ampliseq with a custom config if one thinks 50 is too much.

@d4straub
Copy link
Collaborator Author

Thanks for your review! Unfortunately I seem to have merged the PR just before your comment.
I actually do not know what methods require a minimal length and which not, just DADA2 seems certain.
I considered 2 solutions: (1) adding a filter after ITSx to control its output for all downstream processes, or (2) adding a length filter before DADA2.
My impression was that short sequences are generally not advisable to classify (and didnt know that sintax has a build-in filter) and therefore I decided for option (1). Would you have preferred option (2)?

@jtangrot
Copy link
Contributor

Yes, I realized I was too slow when posting my comment :D I'm fine with this solution too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants