Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add taxonomy subcommand #1543

Merged
merged 145 commits into from
Jun 23, 2021
Merged

[MRG] add taxonomy subcommand #1543

merged 145 commits into from
Jun 23, 2021

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented May 21, 2021

This PR adds the top-level framework for a sourmash tax subcommand, and focuses on reading in gather results files for summarization or classification.

  • sourmash tax metagenome - reads in one or more metagenome gather results files and taxonomies; summarizes at all ranks. Reports this as csv_summary, krona format at a chosen rank, or lineage summary at chosen rank, or any combination of these outputs.
  • sourmash tax genome - reads in one or more genome gather results files and taxonomies; classifies to "best" match given desired threshold or rank. Reports this as csv, krona format, or both.
  • sourmash tax annotate reads in one or more gather results files and taxonomies and adds a lineage column to each gather csv.

for all, writing to stdout is supported if only one output type is specified.

@bluegenes to do checklist:

  • with [MRG] add query info to gather CSV output #1565, we should be able to read in more than one gather file at once for either summarize or classify, without needing to separately collect information about the query name.
    • This also means that we can output the combined lineage/sample summary file directly from summarize if desired. I will work on implementing this, unless there are thoughts otherwise.
  • find appropriate location for argparse threshold input range_limited_float_type
    • this is implicitly tested - should it be tested directly? Where are cli/utils.py functions typically tested?
  • add function docstrings
  • integrate [WIP] properly track duplicate signatures in sourmash lca index #1574
  • integrate hannah's summarize testing [MRG] add test to confirm failure when summarizing on empty gather #1560
  • properly restrict function args with *
  • basic documentation for new commands
  • tutorial / example usage -- see https://github.com/bluegenes/2021-sourmash-taxonomy-hackathon/blob/main/tax_demo.ipynb
  • gather results - check for essential columns (and test bad gather inputs)
  • taxonomy - return available_ranks and check
    • if we can return available_ranks from load_taxonomy_assignments, then we can check that the desired classification rank is available within the lineage data (currently we just assume it's available). This would also enable classification at strain if/when available, with all the caveats that might bring about strains available in the database.
      e.g. add this check to classify after loading tax_assign:
      if args.rank not in available_ranks:
            notify(f"No taxonomic information at rank {args.rank}: cannot classify at this rank")
      

choices, choices:

  • remove combine
  • rename summarize --> metagenome; classify --> genome; label --> annotate (but alias old names :)

suggestions for 6/18 taxonomy hackathon:

design choices:

  • for summarize and classify, we offer several optional outputs. I enabled this by using an --output-base parameter and internally adding standard extensions, e.g. .summarized.csv or .krona.csv. I am happy to change this if there's a better design / one more aligned with our other commands. The goal was to enable multiple output formats at once, and allow us to (relatively easily) enable additional output formats for visualization.
  • classify needs to make some decisions to report the best annotation. I've currently enabled doing this at either a desired rank or given a desired threshold. The benefit of the threshold is that we can go up a rank if there's not sufficient support for an annotation at that rank.

@ctb @taylorreiter @luizirber thoughts?

@codecov
Copy link

codecov bot commented May 21, 2021

Codecov Report

Merging #1543 (891d951) into latest (ff75ec0) will increase coverage by 0.29%.
The diff coverage is 88.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1543      +/-   ##
==========================================
+ Coverage   81.05%   81.34%   +0.29%     
==========================================
  Files         102      109       +7     
  Lines       10314    10749     +435     
  Branches     1172     1260      +88     
==========================================
+ Hits         8360     8744     +384     
- Misses       1748     1781      +33     
- Partials      206      224      +18     
Flag Coverage Δ
python 89.17% <88.26%> (-0.06%) ⬇️
rust 66.47% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/lca/command_index.py 89.84% <63.15%> (-2.63%) ⬇️
src/sourmash/cli/utils.py 90.00% <66.66%> (-10.00%) ⬇️
src/sourmash/tax/__main__.py 73.43% <73.43%> (ø)
src/sourmash/cli/tax/summarize.py 87.50% <87.50%> (ø)
src/sourmash/cli/tax/classify.py 88.88% <88.88%> (ø)
src/sourmash/tax/tax_utils.py 99.47% <99.47%> (ø)
src/sourmash/__init__.py 100.00% <100.00%> (ø)
src/sourmash/cli/__init__.py 95.74% <100.00%> (+0.04%) ⬆️
src/sourmash/cli/lca/index.py 100.00% <100.00%> (ø)
src/sourmash/cli/tax/__init__.py 100.00% <100.00%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff75ec0...891d951. Read the comment docs.

doc/command-line.md Outdated Show resolved Hide resolved
@ctb
Copy link
Contributor

ctb commented Jun 21, 2021 via email

@bluegenes
Copy link
Contributor Author

yes, I think so

ok, will do, but I think we need to clarify / extend docs on how abundances are used!

@bluegenes bluegenes requested a review from ctb June 21, 2021 23:49
@ctb
Copy link
Contributor

ctb commented Jun 22, 2021

Ran through some real-ish data - here are the results! I'll make a specific checklist in the PR review.

(content from https://hackmd.io/KctUsXsLTWGdCuqXS9x4nw)

taxonomy PR tryout!

SRR606249 metagenome ('podar')

run a prefetch of SRR606249 against GTDB genomic reps:

% sourmash prefetch SRR606249-k31.sig  gtdb-r202.genomic-reps.k31.zip -o SRR606249-k31.x.gtdb.prefetch.csv --save-matches SRR606249-k31.x.gtdb.prefetch.zip

This finds 780 matches.

run a gather of SRR606249 against GTDB genomic reps:

% sourmash gather SRR606249-k31.sig SRR606249-k31.x.gtdb.prefetch.zip -o SRR606249-k31.x.gtdb.gather.csv

This finds 84 matches.

run tax annotate

% sourmash tax annotate -g SRR606249-k31.x.gtdb.gather.csv  -t gtdb-rs202.taxonomy.v2.csv 

which gives:

== This is sourmash version 4.1.3.dev7+g0814bcc5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 84 gather results.
of 84, missed 0 lineage assignments.
loaded results from 1 gather CSVs

CTB: I think some output message here would be nice - what file(s) is it outputting? Looks like it's SRR606249-k31.x.gtdb.gather.with-lineages.csv.

run tax metagenome

% sourmash tax metagenome -g SRR606249-k31.x.gtdb.gather.with-lineages.csv -t gtdb-rs202.taxonomy.v2.csv -o BASE

works great :)

CTB: I think some output message here would be nice - what file(s) is it outputting? (In this case, BASE.summarize.csv)

'Aciduliprofundum' query

run a prefetch of podar-ref/1.fa.sig against GTDB

% sourmash prefetch podar-ref/1.fa.sig gtdb-r202.genomic-reps.k31.zip -o 1.x.gtdb.prefetch.csv --save-matches 1.x.gtdb.prefetch.zip -k 31

1 match.

run a gather

% sourmash gather podar-ref/1.fa.sig 1.x.gtdb.prefetch.zip  -o 1.x.gtdb.gather.csv

1 match.

run tax annotate incorrectly

By mistake, I ran

% sourmash tax annotate -t 1.x.gtdb.gather.csv -g gtdb-rs202.taxonomy.v2.csv 

and got the error message ERROR: No taxonomic identifiers found.

CTB: let's make an issue to detect if -g and -t were switched and provide helpful error output :). it's a "good next issue" I think.

run tax annotate correctly

% sourmash tax annotate -g 1.x.gtdb.gather.csv -t gtdb-rs202.taxonomy.v2.csv 

works!

run tax genome

% sourmash tax genome -g 1.x.gtdb.gather.csv -t gtdb-rs202.taxonomy.v2.csv 

gives:

== This is sourmash version 4.1.3.dev7+g0814bcc5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 gather results.
of 1, missed 0 lineage assignments.
loaded results from 1 gather CSVs
WARNING: 100% match! Is query CP001941.1 Aciduliprofundum boonei T469, complete genome identical to its database match, GCF_000025665?
query_name,status,rank,fraction,lineage
"CP001941.1 Aciduliprofundum boonei T469, complete genome",match,species,1.000,d__Archaea;p__Thermoplasmatota;c__Thermoplasmata;o__Aciduliprofundales;f__Aciduliprofundaceae;g__Aciduliprofundum;s__Aciduliprofundum boonei

CTB: please add quotes around the signature name after "Is query ... identical?"

Looking at the output, I see:

query_name,status,rank,fraction,lineage
"CP001941.1 Aciduliprofundum boonei T469, complete genome",match,species,1.000,d__Archaea;p__Thermoplasmatota;c__Thermoplasmata;o__Aciduliprofundales;f__Aciduliprofundaceae;g__Aciduliprofundum;s__Aciduliprofundum boonei

CTB: I think for both metagenome and genome we should add two more columns, query_md5 and query_filename, to the output.

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! A few small things to fix --

  • add output messages indicating which file(s) are being produced, for tax annotate
  • add output messages indicating which file(s) are being produced, for tax genome
  • add output messages indicating which file(s) are being produced, for tax metagenome
  • create a "good next issue" to detect if -g and -t were switched and provide helpful error output
  • in tax genome, please add quotes around the signature name in "Is query {name} identical?"
  • add two more columns, query_md5 and query_filename, to the csv_summary output from tax metagenome
  • add two more columns, query_md5 and query_filename, to the csv_summary output from tax genome

@ctb
Copy link
Contributor

ctb commented Jun 22, 2021

yes, I think so

ok, will do, but I think we need to clarify / extend docs on how abundances are used!

Hot take: the docs here (classifying-signatures.md) are good enough, and what we really need is someone to do some real-world validation and then feed that experience back into the docs.

@bluegenes bluegenes requested a review from ctb June 22, 2021 23:15
Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@ctb
Copy link
Contributor

ctb commented Jun 23, 2021

huh, tests aren't passing tho?

@bluegenes
Copy link
Contributor Author

huh, tests aren't passing tho?

They keep getting canceled by server. I've hit rerun - hopefully they'll run this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants