[MRG] add taxonomy subcommand #1543

bluegenes · 2021-05-21T19:05:10Z

This PR adds the top-level framework for a sourmash tax subcommand, and focuses on reading in gather results files for summarization or classification.

sourmash tax metagenome - reads in one or more metagenome gather results files and taxonomies; summarizes at all ranks. Reports this as csv_summary, krona format at a chosen rank, or lineage summary at chosen rank, or any combination of these outputs.
sourmash tax genome - reads in one or more genome gather results files and taxonomies; classifies to "best" match given desired threshold or rank. Reports this as csv, krona format, or both.
sourmash tax annotate reads in one or more gather results files and taxonomies and adds a lineage column to each gather csv.

for all, writing to stdout is supported if only one output type is specified.

@bluegenes to do checklist:

with [MRG] add query info to gather CSV output #1565, we should be able to read in more than one gather file at once for either summarize or classify, without needing to separately collect information about the query name.
- This also means that we can output the combined lineage/sample summary file directly from summarize if desired. I will work on implementing this, unless there are thoughts otherwise.
find appropriate location for argparse threshold input range_limited_float_type
- this is implicitly tested - should it be tested directly? Where are cli/utils.py functions typically tested?
add function docstrings
integrate [WIP] properly track duplicate signatures in sourmash lca index #1574
integrate hannah's summarize testing [MRG] add test to confirm failure when summarizing on empty gather #1560
properly restrict function args with *
basic documentation for new commands
tutorial / example usage -- see https://github.com/bluegenes/2021-sourmash-taxonomy-hackathon/blob/main/tax_demo.ipynb
gather results - check for essential columns (and test bad gather inputs)
taxonomy - return available_ranks and check
- if we can return available_ranks from load_taxonomy_assignments, then we can check that the desired classification rank is available within the lineage data (currently we just assume it's available). This would also enable classification at strain if/when available, with all the caveats that might bring about strains available in the database.
  e.g. add this check to classify after loading tax_assign:
```
if args.rank not in available_ranks:
      notify(f"No taxonomic information at rank {args.rank}: cannot classify at this rank")
```

choices, choices:

remove combine
rename summarize --> metagenome; classify --> genome; label --> annotate (but alias old names :)

suggestions for 6/18 taxonomy hackathon:

improve tax documentation? or write some recipes?
CAMI Output!? see https://github.com/luizirber/2020-cami/blob/master/scripts/gather_to_opal.py

design choices:

for summarize and classify, we offer several optional outputs. I enabled this by using an --output-base parameter and internally adding standard extensions, e.g. .summarized.csv or .krona.csv. I am happy to change this if there's a better design / one more aligned with our other commands. The goal was to enable multiple output formats at once, and allow us to (relatively easily) enable additional output formats for visualization.
classify needs to make some decisions to report the best annotation. I've currently enabled doing this at either a desired rank or given a desired threshold. The benefit of the threshold is that we can go up a rank if there's not sufficient support for an annotation at that rank.

@ctb @taylorreiter @luizirber thoughts?

… identifiers/taxonomy

…n to CLI

codecov · 2021-05-21T19:11:11Z

Codecov Report

Merging #1543 (891d951) into latest (ff75ec0) will increase coverage by 0.29%.
The diff coverage is 88.26%.

@@            Coverage Diff             @@
##           latest    #1543      +/-   ##
==========================================
+ Coverage   81.05%   81.34%   +0.29%     
==========================================
  Files         102      109       +7     
  Lines       10314    10749     +435     
  Branches     1172     1260      +88     
==========================================
+ Hits         8360     8744     +384     
- Misses       1748     1781      +33     
- Partials      206      224      +18

Flag	Coverage Δ
python	`89.17% <88.26%> (-0.06%)`	⬇️
rust	`66.47% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/lca/command_index.py	`89.84% <63.15%> (-2.63%)`	⬇️
src/sourmash/cli/utils.py	`90.00% <66.66%> (-10.00%)`	⬇️
src/sourmash/tax/__main__.py	`73.43% <73.43%> (ø)`
src/sourmash/cli/tax/summarize.py	`87.50% <87.50%> (ø)`
src/sourmash/cli/tax/classify.py	`88.88% <88.88%> (ø)`
src/sourmash/tax/tax_utils.py	`99.47% <99.47%> (ø)`
src/sourmash/__init__.py	`100.00% <100.00%> (ø)`
src/sourmash/cli/__init__.py	`95.74% <100.00%> (+0.04%)`	⬆️
src/sourmash/cli/lca/index.py	`100.00% <100.00%> (ø)`
src/sourmash/cli/tax/__init__.py	`100.00% <100.00%> (ø)`
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff75ec0...891d951. Read the comment docs.

…into add-taxonomy

…to add-taxonomy

src/sourmash/cli/tax/classify.py

doc/command-line.md

ctb · 2021-06-21T23:39:29Z

yes, I think so

bluegenes · 2021-06-21T23:44:27Z

yes, I think so

ok, will do, but I think we need to clarify / extend docs on how abundances are used!

ctb · 2021-06-22T15:20:25Z

Ran through some real-ish data - here are the results! I'll make a specific checklist in the PR review.

(content from https://hackmd.io/KctUsXsLTWGdCuqXS9x4nw)

taxonomy PR tryout!

SRR606249 metagenome ('podar')

run a prefetch of SRR606249 against GTDB genomic reps:

% sourmash prefetch SRR606249-k31.sig  gtdb-r202.genomic-reps.k31.zip -o SRR606249-k31.x.gtdb.prefetch.csv --save-matches SRR606249-k31.x.gtdb.prefetch.zip

This finds 780 matches.

run a gather of SRR606249 against GTDB genomic reps:

% sourmash gather SRR606249-k31.sig SRR606249-k31.x.gtdb.prefetch.zip -o SRR606249-k31.x.gtdb.gather.csv

This finds 84 matches.

run `tax annotate`

% sourmash tax annotate -g SRR606249-k31.x.gtdb.gather.csv  -t gtdb-rs202.taxonomy.v2.csv

which gives:

== This is sourmash version 4.1.3.dev7+g0814bcc5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 84 gather results.
of 84, missed 0 lineage assignments.
loaded results from 1 gather CSVs

CTB: I think some output message here would be nice - what file(s) is it outputting? Looks like it's SRR606249-k31.x.gtdb.gather.with-lineages.csv.

run `tax metagenome`

% sourmash tax metagenome -g SRR606249-k31.x.gtdb.gather.with-lineages.csv -t gtdb-rs202.taxonomy.v2.csv -o BASE

works great :)

CTB: I think some output message here would be nice - what file(s) is it outputting? (In this case, BASE.summarize.csv)

'Aciduliprofundum' query

run a prefetch of podar-ref/1.fa.sig against GTDB

% sourmash prefetch podar-ref/1.fa.sig gtdb-r202.genomic-reps.k31.zip -o 1.x.gtdb.prefetch.csv --save-matches 1.x.gtdb.prefetch.zip -k 31

1 match.

run a gather

% sourmash gather podar-ref/1.fa.sig 1.x.gtdb.prefetch.zip  -o 1.x.gtdb.gather.csv

1 match.

run `tax annotate` incorrectly

By mistake, I ran

% sourmash tax annotate -t 1.x.gtdb.gather.csv -g gtdb-rs202.taxonomy.v2.csv

and got the error message ERROR: No taxonomic identifiers found.

CTB: let's make an issue to detect if -g and -t were switched and provide helpful error output :). it's a "good next issue" I think.

run `tax annotate` correctly

% sourmash tax annotate -g 1.x.gtdb.gather.csv -t gtdb-rs202.taxonomy.v2.csv

works!

run `tax genome`

% sourmash tax genome -g 1.x.gtdb.gather.csv -t gtdb-rs202.taxonomy.v2.csv

gives:

== This is sourmash version 4.1.3.dev7+g0814bcc5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 gather results.
of 1, missed 0 lineage assignments.
loaded results from 1 gather CSVs
WARNING: 100% match! Is query CP001941.1 Aciduliprofundum boonei T469, complete genome identical to its database match, GCF_000025665?
query_name,status,rank,fraction,lineage
"CP001941.1 Aciduliprofundum boonei T469, complete genome",match,species,1.000,d__Archaea;p__Thermoplasmatota;c__Thermoplasmata;o__Aciduliprofundales;f__Aciduliprofundaceae;g__Aciduliprofundum;s__Aciduliprofundum boonei

CTB: please add quotes around the signature name after "Is query ... identical?"

Looking at the output, I see:

query_name,status,rank,fraction,lineage
"CP001941.1 Aciduliprofundum boonei T469, complete genome",match,species,1.000,d__Archaea;p__Thermoplasmatota;c__Thermoplasmata;o__Aciduliprofundales;f__Aciduliprofundaceae;g__Aciduliprofundum;s__Aciduliprofundum boonei

CTB: I think for both metagenome and genome we should add two more columns, query_md5 and query_filename, to the output.

ctb

Looking good! A few small things to fix --

add output messages indicating which file(s) are being produced, for tax annotate
add output messages indicating which file(s) are being produced, for tax genome
add output messages indicating which file(s) are being produced, for tax metagenome
create a "good next issue" to detect if -g and -t were switched and provide helpful error output
in tax genome, please add quotes around the signature name in "Is query {name} identical?"
add two more columns, query_md5 and query_filename, to the csv_summary output from tax metagenome
add two more columns, query_md5 and query_filename, to the csv_summary output from tax genome

ctb · 2021-06-22T15:47:04Z

yes, I think so

ok, will do, but I think we need to clarify / extend docs on how abundances are used!

Hot take: the docs here (classifying-signatures.md) are good enough, and what we really need is someone to do some real-world validation and then feed that experience back into the docs.

ctb

🎉

ctb · 2021-06-23T00:10:54Z

huh, tests aren't passing tho?

bluegenes · 2021-06-23T13:03:28Z

huh, tests aren't passing tho?

They keep getting canceled by server. I've hit rerun - hopefully they'll run this time.

…axonomy

ctb and others added 3 commits May 21, 2021 07:47

provide a kind of ridiculous upgrade to lca index to better deal with…

db32c56

… identifiers/taxonomy

update load_taxonomy_assignments to be more flexible and pay attentio…

2c305ed

…n to CLI

init structure for taxonomy subcommand

1908205

bluegenes and others added 7 commits May 21, 2021 17:37

more init

494dbbc

syntax and add tax to init

ac3a553

Merge branch 'latest' into add-taxonomy

1966afc

Merge branch 'latest' into update/lca_index

85ef9fa

Merge branch 'latest' into add-taxonomy

ce939e5

init tax tests

208a9b3

Merge branch 'latest' into update/lca_index

7e822c6

bluegenes mentioned this pull request May 25, 2021

[WIP] improve identifier & taxonomy parsing for lca index #1542

Closed

bluegenes and others added 10 commits May 25, 2021 13:21

Merge branch 'update/lca_index' of https://github.com/dib-lab/sourmash …

a5bc6f0

…into add-taxonomy

working tax summarize command

33c55f3

fix main

850e4e4

init tests for new tax_utils

0029d51

add ascending taxlist

ea2456a

init classify cmd

1075b68

init tax cli testing

88b643b

Merge branch 'latest' into add-taxonomy

822d664

fix filename

a26052d

Merge branch 'add-taxonomy' of https://github.com/dib-lab/sourmash in…

d2fef53

…to add-taxonomy

luizirber reviewed May 28, 2021

View reviewed changes

src/sourmash/cli/tax/classify.py Outdated Show resolved Hide resolved

bluegenes and others added 4 commits May 28, 2021 09:51

change to function for classify threshold

1448985

add header

06daba4

enable single gather result for summarize; mult for classify

bd26822

add util script to take output of tax and format for krona viz (#1559)

7b5fc72

hehouts mentioned this pull request May 28, 2021

[MRG] add test to confirm failure when summarizing on empty gather #1560

Merged

bluegenes and others added 2 commits May 28, 2021 12:06

Merge branch 'latest' into add-taxonomy

63be068

get summarized working for summary and krona output

2c5f864

bluegenes and others added 4 commits June 21, 2021 15:57

two-sample lineage summary example

27260dc

rename commands

4f33026

upd function names too

18294fd

Merge branch 'latest' into add-taxonomy

b6ef1f2

bluegenes commented Jun 21, 2021

View reviewed changes

doc/command-line.md Outdated Show resolved Hide resolved

minor doc upds

731be75

use f_unique_to_query

2760786

bluegenes requested a review from ctb June 21, 2021 23:49

ctb requested changes Jun 22, 2021

View reviewed changes

bluegenes added 2 commits June 22, 2021 13:36

add output dir; add file output notifications

d68e3d0

add cols to summary outputs; adjust accordingly

d5d7e51

bluegenes mentioned this pull request Jun 22, 2021

for sourmash tax, detect if -g and -t were switched and provide helpful error output #1626

Open

Merge branch 'latest' into add-taxonomy

c81cab8

bluegenes requested a review from ctb June 22, 2021 23:15

ctb approved these changes Jun 23, 2021

View reviewed changes

ctb added 2 commits June 23, 2021 06:20

trigger GitHub actions

d5c51d4

Merge branch 'add-taxonomy' of github.com:dib-lab/sourmash into add-t…

ab2a4c8

…axonomy

bluegenes merged commit d473199 into latest Jun 23, 2021

bluegenes deleted the add-taxonomy branch June 23, 2021 13:58

This was referenced Jun 24, 2021

Draft release notes for v4.2.0 #1604

Closed

moving toward sourmash taxonomy for taxonomy reporting and manipulation from sourmash gather results #1515

Closed

implement a sourmash classify command? #1099

Closed

ctb mentioned this pull request Sep 23, 2021

revamp load_taxonomy_assignments and command_index.py #1191

Closed

ctb mentioned this pull request Dec 30, 2021

update charcoal with functions from sourmash.tax dib-lab/charcoal#204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] add taxonomy subcommand #1543

[MRG] add taxonomy subcommand #1543

bluegenes commented May 21, 2021 •

edited

Loading

codecov bot commented May 21, 2021 •

edited

Loading

ctb commented Jun 21, 2021 via email

bluegenes commented Jun 21, 2021

ctb commented Jun 22, 2021

ctb left a comment •

edited by bluegenes

Loading

ctb commented Jun 22, 2021

ctb left a comment

ctb commented Jun 23, 2021

bluegenes commented Jun 23, 2021

[MRG] add taxonomy subcommand #1543

[MRG] add taxonomy subcommand #1543

Conversation

bluegenes commented May 21, 2021 • edited Loading

codecov bot commented May 21, 2021 • edited Loading

Codecov Report

ctb commented Jun 21, 2021 via email

bluegenes commented Jun 21, 2021

ctb commented Jun 22, 2021

taxonomy PR tryout!

SRR606249 metagenome ('podar')

run a prefetch of SRR606249 against GTDB genomic reps:

run a gather of SRR606249 against GTDB genomic reps:

run tax annotate

run tax metagenome

'Aciduliprofundum' query

run a prefetch of podar-ref/1.fa.sig against GTDB

run a gather

run tax annotate incorrectly

run tax annotate correctly

run tax genome

ctb left a comment • edited by bluegenes Loading

Choose a reason for hiding this comment

ctb commented Jun 22, 2021

ctb left a comment

Choose a reason for hiding this comment

ctb commented Jun 23, 2021

bluegenes commented Jun 23, 2021

bluegenes commented May 21, 2021 •

edited

Loading

codecov bot commented May 21, 2021 •

edited

Loading

run `tax annotate`

run `tax metagenome`

run `tax annotate` incorrectly

run `tax annotate` correctly

run `tax genome`

ctb left a comment •

edited by bluegenes

Loading