Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decontaminating GTDB itself? #121

Open
ctb opened this issue Jul 8, 2020 · 1 comment
Open

decontaminating GTDB itself? #121

ctb opened this issue Jul 8, 2020 · 1 comment

Comments

@ctb
Copy link
Member

ctb commented Jul 8, 2020

it would be nice to be able to decontaminate GTDB itself, but one of the problems we face there is that charcoal doesn't work well in situations where we have the exact genome in question in the reference database. this is because the first filter is to use gather to search the reference database, and it will return precisely 1 match in that situation.

so the question is, how can we deal with this? two ideas --

  • allow the initial gather to be a search, instead, and then tell just_taxonomy.py to ignore exact matches internally.
  • find ways to mask or temporarily eliminate specific signatures from the databases.

on GTDB 25k, which is really nice and non-redundant, it should be straightforward (if medium expensive) to do the search, so maybe we should start there.

ref sourmash-bio/sourmash#849

@ctb
Copy link
Member Author

ctb commented Jul 9, 2020

search against GTDB per #122 doesn't seem that bad, perhaps because it's not that redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant