-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Other uses of k-mer sketches created by skani? #23
Comments
Hi @jolespin, Great question: what you're describing is almost exactly what kraken does. Kraken sketches each genome into k-mers (they don't call it sketching, but it's the same idea) and checks each read from a sample against their database of k-mers. They combine all of their "sketches" into one large database, though. The key difference between "aligning" samples (e.g. using kraken) versus genomes (e.g. using skani) is that reads are much shorter than genomes. Thus, different parameters/methods are necessary. Of course, that's why the outputs differ between the algorithms as well. To answer your question: skani can not be used for abundance calculations for short reads, unfortunately, due to how it's designed. It may be able to be hacked to do abundances for long-reads... but that's not something I've explored yet. I recently designed a method called sylph for profiling metagenomic samples and giving abundances. It also uses the "sketching" idea for fast abundance profiling, for arbitrary, custom databases of fasta files, which seems important to you. See the associated manuscript in the README for info. Thanks, Jim |
Thank you. I forgot about Kraken ever since I started testing out Metaphlan4. I forget that these tools have different ways in which they profile.
Good to know when I get more long read datasets in the future.
This is great and definitely along the lines of what I'm looking for. And yes, definitely interested in building custom databases. It would be nice to have a gut, oral, soill, or marine-specific database built from MAGs. A few follow up questions:
In this usage, 2a. Can you create a database from existing genome sketches? Or should you always create a sketch database from genomes? The reason why I'm asking is I'm wondering if it will be useful to create all the sketches you may need then mix and match when creating curated databases. 2b. Just so I'm understanding correctly, this is the recommended usage if you're operating on a per sample basis:
I found the answer to (4) here: https://github.com/bluenote-1577/sylph/wiki/sylph-cookbook#profiling-small-genomes-such-as-viruses
Since the output .sylsp uses the basename, I'm unable to run multiple samples at the same time since all the files have the same basename and the sample name is encoded in the directory (e.g., S1) in this case. Not a huge deal because I can run them individually for each sample (or symlink if I need to run in batch) but was wondering if this is currently possible or will be included in future versions? Apologies for all the questions but this and skani came at a very convenient time in my research. Thanks again for your help and creating these incredible tools. |
Great to know it's of use. Check https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases for some pre-built databases. Answering questions: 1: No, the sketch format is different and not interchangeable. 2: Yes, when sketching reads, the output is a .sylsp file for each (pair) of reads. Whereas for genomes, all genomes get combined into one database sketch. 2a: You can't combine sketches to create a database, but you can use multiple databases 2b: Sounds correct. Make sure you use -1 and -2 options for sketching paired-end reads. 3: The operation is on the whole genome unless the 6: You're correct in that this is a problem. This is a very reasonable feature request and a problem that I actually ran into as well. I'll definitely incorporate an option for specifying sample names in the next version. Thanks for the questions. Let me know if you found any documentation insufficient or confusing. Or, if you find any command line usage confusing. I'm very interested in user feedback. I may repost this issue to the sylph repo or add a FAQ sometime in the future with your questions if you don't mind! |
I've been able to build a pretty good wrapper around this and will include in the version 2 manuscript I'm writing right now as the The documentation and your cookbooks are really helpful, I feel like you covered pretty much all of my questions. A couple of features that aren't essential but I feel could make this software more scalable is the following:
The reason for 3,4,5 is the scenario where Anyways, feel free to take these with a grain of salt as the software is already great. I saw that it's up on biorxiv, good luck on your publication! I'll definitely being sharing this tool within my circle. |
Thanks for the feedback! I'll look to incorporate some of these ideas in future releases. |
I'm wondering if there are other uses for the k-mer sketches for each genome. For example, let's say you had 1000 genomes and built an index/sketch for each one. Do you know of any methods to determine the relative abundance of a sample by aligning to these sketches instead of the full genomes? I may be thinking of the problem incorrectly but that's why I thought I would reach out to ask if this was a possibility using skani sketches or if there are any tools you know of that can do this with custom databases?
The text was updated successfully, but these errors were encountered: