-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add taxonomy subcommand #1543
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1543 +/- ##
==========================================
+ Coverage 81.05% 81.34% +0.29%
==========================================
Files 102 109 +7
Lines 10314 10749 +435
Branches 1172 1260 +88
==========================================
+ Hits 8360 8744 +384
- Misses 1748 1781 +33
- Partials 206 224 +18
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
…into add-taxonomy
yes, I think so
|
ok, will do, but I think we need to clarify / extend docs on how abundances are used! |
Ran through some real-ish data - here are the results! I'll make a specific checklist in the PR review. (content from https://hackmd.io/KctUsXsLTWGdCuqXS9x4nw) taxonomy PR tryout!SRR606249 metagenome ('podar')run a prefetch of SRR606249 against GTDB genomic reps:
This finds 780 matches. run a gather of SRR606249 against GTDB genomic reps:
This finds 84 matches. run
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! A few small things to fix --
- add output messages indicating which file(s) are being produced, for
tax annotate
- add output messages indicating which file(s) are being produced, for
tax genome
- add output messages indicating which file(s) are being produced, for
tax metagenome
- create a "good next issue" to detect if -g and -t were switched and provide helpful error output
- in
tax genome
, please add quotes around the signature name in "Is query {name} identical?" - add two more columns, query_md5 and query_filename, to the
csv_summary
output fromtax metagenome
- add two more columns, query_md5 and query_filename, to the
csv_summary
output fromtax genome
Hot take: the docs here ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
huh, tests aren't passing tho? |
They keep getting canceled by server. I've hit rerun - hopefully they'll run this time. |
This PR adds the top-level framework for a
sourmash tax
subcommand, and focuses on reading in gather results files for summarization or classification.sourmash tax metagenome
- reads in one or more metagenome gather results files and taxonomies; summarizes at all ranks. Reports this ascsv_summary
,krona
format at a chosen rank, orlineage summary
at chosen rank, or any combination of these outputs.sourmash tax genome
- reads in one or more genome gather results files and taxonomies; classifies to "best" match given desired threshold or rank. Reports this ascsv
,krona
format, or both.sourmash tax annotate
reads in one or more gather results files and taxonomies and adds alineage
column to each gather csv.for all, writing to stdout is supported if only one output type is specified.
@bluegenes
to do
checklist:summarize
orclassify
, without needing to separately collect information about the query name.summarize
if desired. I will work on implementing this, unless there are thoughts otherwise.range_limited_float_type
cli/utils.py
functions typically tested?sourmash lca index
#1574available_ranks
fromload_taxonomy_assignments
, then we can check that the desired classification rank is available within the lineage data (currently we just assume it's available). This would also enable classification at strain if/when available, with all the caveats that might bring about strains available in the database.e.g. add this check to
classify
after loadingtax_assign
:choices, choices:
combine
summarize
-->metagenome
;classify
-->genome
;label
-->annotate
(but alias old names :)suggestions for 6/18 taxonomy hackathon:
design choices:
summarize
andclassify
, we offer several optional outputs. I enabled this by using an--output-base
parameter and internally adding standard extensions, e.g..summarized.csv
or.krona.csv
. I am happy to change this if there's a better design / one more aligned with our other commands. The goal was to enable multiple output formats at once, and allow us to (relatively easily) enable additional output formats for visualization.classify
needs to make some decisions to report the best annotation. I've currently enabled doing this at either a desired rank or given a desiredthreshold
. The benefit of the threshold is that we can go up a rank if there's not sufficient support for an annotation at that rank.@ctb @taylorreiter @luizirber thoughts?