-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
moving toward sourmash taxonomy
for taxonomy reporting and manipulation from sourmash gather
results
#1515
Comments
I'm on board! Is the idea that this would be available for sourmash 4.2? #1481 |
do you envision that the taxonomy spreadsheet format would be changing much, or would those be independent changes? |
This is something we started talking about. The spreadsheet for NCBI currently has fields (synonymous with) One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once. |
a few quick thoughts based on thinking while running - feel free to reject
|
re @taylorreiter comments --
I think this use case is a strong argument in support of using column names and enabling multiple lineage spreadsheets to be read in separately. As we read in each lineage csv, we would require that certain columns exist, but otherwise be flexible (e.g. - these can contain additional information that we just ignore). This also would provide a set of standardized guidelines for folks to build their own lineage spreadsheets if they need. re: @ctb comments --
yep!
yes! Though I think we were talking about dropping the We somewhat decided to focus on the
I think what you're saying is |
good!
absolutely, especially since we won't be using LCA methods in the same way :) note that (b/c of semantic versioning) we won't be removing the lca commands completely until v6 at the earliest. But we can deprecate them for v5.
k!
yep! |
Note - when processing lineages, we should try to ignore assembly version info ( |
running into exactly this with |
Organization question, mainly for @ctb, but also everyone: How do we want to split functionality between thoughts? |
suggest converse - copy or move functions over to To my understanding, It might complicated things during the hackathon, tho, so it's totally OK to just leave things as they are and reference the |
wonderful. I didn't want to suggest this because I was worried about backwards compatibility, but ofc, can just reference the functions in
good point. Will start with copying over the functions we use directly, to allow modification as needed during the hackathon. |
ref dib-lab/charcoal#174:
|
good point. Will start with copying over the functions we use directly, to allow modification as needed during the hackathon.
maybe: copy on write?
import from lca until you need to change, then when you need to change,
copy.
e.g. the tree/LCA stuff is unlikely to need changes, but the taxonomy
loading stuff is ...questionable :)
|
preserving from slack bluegenes:feet: titus:speech_balloon: bluegenes:feet: titus:speech_balloon: |
@luizirber and @bluegenes and I have been getting more excited about a
sourmash taxonomy
command, and potentially tackling pieces of it in a DIB lab hackathon. We had a conversation about this and wanted to summarize the main points here, as well as continue brainstorming.Goal: command line interface that takes one or multiple
sourmash gather
csvs and a lineage csv and provides taxonomic rank summarization and downstream formatting for ingestion by popular taxonomy visualization tools.Relevant Issues:
Relevant Repos:
assembly_stats.txt
. Alternative to2018-ncbi-lineages
that parses the genbankassembly_stats.txt
for the assembly accession and taxon id./home/irber/sourmash_databases/outputs/lca/lineages
What needs to be included in
sourmash taxonomy
? Command line interface, inputs and outputs:What should the command line interface look like? What should the format of the inputs and outputs be? What functionality should be included?
sourmash gather
csv files+ column 1: dataset identification in database (e.g. unique identifier from SBT [NOT MD5]; assembly accession)
+ formatting of subsequent columns needs to be standardized, or parsed from standardized column names (e.g.
superkingdom
)convert
/liftover
<- convert between taxonomiessummarize
; similar tolca summarize
cami
format output <- (from https://github.com/luizirber/2020-cami)krona
format outputnewick
format output?Think about for the future
The text was updated successfully, but these errors were encountered: