[MRG] add taxonomy subcommand (#1543)

* provide a kind of ridiculous upgrade to lca index to better deal with identifiers/taxonomy * update load_taxonomy_assignments to be more flexible and pay attention to CLI * init structure for taxonomy subcommand * more init * syntax and add tax to init * init tax tests * working tax summarize command * fix main * init tests for new tax_utils * add ascending taxlist * init classify cmd * init tax cli testing * fix filename * change to function for classify threshold * add header * enable single gather result for summarize; mult for classify * add util script to take output of tax and format for krona viz (#1559) * get summarized working for summary and krona output * init test krona output * add write_summary function * get classify working again, both summary and krona output * test write_classification * init classify cli tests * init tests for load_taxonomy_assignments * enable force for getting past duplicated entries in taxonomy csv * handle and test missing taxonomy info * classify: handle, test empty gather results, gather results from csv * split identifiers by default * standardize spacing * comments * init add tax to docs * [MRG] add a function to take multiple sourmash tax summarize csvs and output a single "abundance" df (#1562) * add a function to take multiple sourmash tax summarize csvs and output a single 'abundance' df * add import dep. * rework format_tax_to_frac for easier testing and use; add tests * better name and docstring for agg_sumgather_csvs_by_lineage * init combine command * debugging code to help track down SBT duplicates/loss problem * fix, I think * remove unnecessary code * add test for duplicate signatures in SBT creation * see what happens when you run twice * add missing signatures, oops * initial refactoring that passes many tests * factor filename generation out of actual writing * refactor and cleanup * add more sigs to test, add note of concern L) * fix --append tests, too * refactor out save_exact in favor if save(..., overwrite=True) * fix some storage stuff in the tests * make test less confusing? * Update src/sourmash/sbt_storage.py Co-authored-by: Luiz Irber <[email protected]> * define list_sbts() on base Storage class * properly record duplicate signature names * move threshold arg parsing into cli/utils * init changes for multiquery input * use namedtuple for summarized gather results * init update for mult files * adjust for namedtuple output * mods for namedtuple * upd utils * add --from-file to summarize * working multifile summarize * --from-csv to --from-file * somewhat working classify again * updated classify * finish fixing combine test * make taxonomy_csv required * cleanup * more cleanup * use load_pathlist_from_file * test check_and_load_gather_csvs * properly restrict kwargs with * * allow lineage summary table output from summarize * require rank for krona, lineage summary output formats * add test for lineage summary output with format_lineage * add docstrings * punt cami to separate PR * raise ValueError on empty gather results * move notify * cleanup * remove tax combine; add tax label * verson of load_taxonomy that strictly uses headers * use new tax fn; enable mult taxonomy inputs * init tax docs * add classification status to classify output * add multi db gather test csv * fix typo * whoops, actually fix * handle accession in lineage csv header * fix line width * return available ranks from load_taxonomy_csv * [MRG] add test to confirm failure when summarizing on empty gather (#1560) * added summarize on empty gather.csv test * added empty taxonomy csv test * fix typo * Update test_tax.py * trouble shoot tests * troubleshooting empty gather test * fixed, maybe? (#1596) * trying to pull * cleaned up test_summarize_empty_gather_tax... * cleaned test_summarize_empty_gather * fixed test comments * removed comment * Update tests/test_tax.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_tax.py Co-authored-by: Tessa Pierce Ward <[email protected]> * updated empty-gather test Co-authored-by: Tessa Pierce Ward <[email protected]> Co-authored-by: C. Titus Brown <[email protected]> * init standardize errs * add good valueerror for empty lineage csv file * better catch errs in __main__; test all cmds: empty gather, lineage files * check available ranks, bad gather headers, empty gather, etc * emit one (and only one) warning per 100% match * add all functions to __all__ ...is this desired? * change cli for better arg parsing; add examples to summarize docs * add mixed strain classify example and assoc data * doc formatting * more doc formatting * more tiny reformatting * Apply suggestions from code review Co-authored-by: C. Titus Brown <[email protected]> * add usage info for each subcommand; upd docs * minor cleanup and more informative warning * add checks for duplicated queries, better output for duplicated filenames; label --> annotate * better catch/print for valueerrors; pyflakes fixes * summary --> csv_summary * upd docs * two-sample lineage summary example * rename commands * upd function names too * minor doc upds * use f_unique_to_query * add output dir; add file output notifications * add cols to summary outputs; adjust accordingly * trigger GitHub actions Co-authored-by: C. Titus Brown <[email protected]> Co-authored-by: Taylor Reiter <[email protected]> Co-authored-by: Luiz Irber <[email protected]> Co-authored-by: Hannah Eve Houts <[email protected]>
sourmash-bio · Jun 23, 2021 · d473199 · d473199
1 parent c04f137
commit d473199
Show file tree

Hide file tree

Showing 21 changed files with 3,557 additions and 16 deletions.
diff --git a/doc/command-line.md b/doc/command-line.md
@@ -70,16 +70,26 @@ There are seven main subcommands: `sketch`, `compare`, `plot`,
 * `prefetch` selects signatures of interest from a very large collection of signatures, for later processing.
 
 There are also a number of commands that work with taxonomic
-information; these are grouped under the `sourmash lca`
-subcommand. See [the LCA tutorial](tutorials-lca.md) for a
-walkthrough of these commands.
+information; these are grouped under the `sourmash tax` and
+`sourmash lca` subcommands.
+
+`sourmash tax` commands:
+
+* `tax metagenome` - summarize metagenome gather results at each taxonomic rank.
+* `tax genome`     - summarize single-genome gather results and report most likely classification.
+* `tax annotate`   - annotate gather results with lineage information (no summarization or classification).
+
+`sourmash lca` commands:
 
 * `lca classify` classifies many signatures against an LCA database.
 * `lca summarize` summarizes the content of metagenomes using an LCA database.
 * `lca index` creates a database for use with LCA subcommands.
 * `lca rankinfo` summarizes the content of a database.
 * `lca compare_csv` compares lineage spreadsheets, e.g. those output by `lca classify`.
 
+> See [the LCA tutorial](tutorials-lca.md) for a
+walkthrough of some of these commands.
+
 Finally, there are a number of utility and information commands:
 
 * `info` shows version and software information.
@@ -411,7 +421,270 @@ This combination of commands ensures that the more time- and
 memory-intensive `gather` step is run only on a small set of relevant
 signatures, rather than all the signatures in the database.
 
-## `sourmash lca` subcommands for taxonomic classification
+## `sourmash tax` subcommands for integrating taxonomic information into gather results
+
+The sourmash `tax` or `taxonomy` commands integrate taxonomic
+ information into the results of `sourmash gather`. All `tax` commands
+ require a properly formatted `taxonomy` csv file that corresponds to
+ the database used for `gather`. For supported databases (e.g. GTDB, NCBI),
+ we provide these files, but they can also be generated for user-generated
+ databases. For more information, see [databases](databases.md).
+
+These commands rely upon the fact that `gather` provides both the total
+ fraction of the query matched to each database matched, as well as a
+ non-overlapping `f_unique_to_query` which is the fraction of the query
+ uniquely matched to each reference genome. The `f_unique_to_query` for
+ any reference match will always be between (0% of query matched) and 1
+ (100% of query matched), and for a query matched to multiple references,
+ the `f_unique_to_query` will sum to at most 1 (100% of query matched).
+ We use this property to aggregate gather matches at the desired
+ taxonomic rank. For example, if the gather results for a metagenome
+ include results for 30 different strains of a given species, we can sum
+ the fraction uniquely matched to each strain to obtain the fraction
+ uniquely matched to this species. Note that this summarization can
+ also take into account abundance weighting; see
+ [Classifying Signatures](classifying-signatures.html) for more
+ information.
+
+As with all reference-based analysis, results can be affected by the
+ completeness of the reference database. However, summarizing taxonomic
+ results from `gather` minimizes issues associated with increasing size
+ and redundancy of reference databases.
+
+For more on how `gather` works and can be used to classify signatures, see
+ [classifying-signatures](classifying-signatures.html)
+
+
+### `sourmash tax metagenome` - summarize metagenome content from `gather` results
+
+`sourmash tax metagenome` summarizes gather results for each query by
+ taxonomic lineage.
+
+example command to summarize a single `gather csv`, where the query was gathered
+ against `gtdb-rs202` representative species database:
+
+```
+sourmash tax metagenome
+    --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv
+```
+
+There are three possible output formats, `csv_summary`, `lineage_summary`, and
+ `krona`.
+
+#### `csv_summary` output format
+
+`csv_summary` is the default output format. This outputs a `csv` with lineage
+ summarization for each taxonomic rank. This output currently consists of four
+ columns, `query_name,rank,fraction,lineage`, where `fraction` is the  fraction
+ of the query matched to the reported rank and lineage.
+
+example `csv_summary` output from the command above:
+
+```
+query_name,rank,fraction,lineage
+HSMA33MX,superkingdom,0.131,d__Bacteria
+HSMA33MX,phylum,0.073,d__Bacteria;p__Bacteroidota
+HSMA33MX,phylum,0.058,d__Bacteria;p__Proteobacteria
+.
+.
+.
+HSMA33MX,species,0.058,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;
+o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli
+HSMA33MX,species,0.057,d__Bacteria;p__Bacteroidota;c__Bacteroidia;
+o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri
+HSMA33MX,species,0.016,d__Bacteria;p__Bacteroidota;c__Bacteroidia;
+o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus
+```
+
+#### `krona` output format
+
+`krona` format is a tab-separated list of these results at a specific rank.
+ The first column, `fraction` is the fraction of the query matched to the
+ reported rank and lineage. The remaining columns are `superkingdom`, `phylum`,
+ ... etc down to the rank used for summarization. This output can be used
+ directly for summary visualization.
+
+To generate `krona`, we add `--output-format krona` to the command above, and
+ need to specify a rank to summarize. Here's the command for reporting `krona`
+ summary at `species` level:
+
+```
+sourmash tax metagenome
+    --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
+    --output-format krona --rank species
+```
+
+example krona output from this command:
+
+```
+fraction	superkingdom	phylum	class	order	family	genus	species
+0.05815279361459521	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Escherichia coli
+0.05701254275940707	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella	Prevotella copri
+0.015637726014008795	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	Bacteroides vulgatus
+```
+
+#### `lineage_summary` output format
+
+The lineage summary format is most useful when comparing across metagenome queries.
+ Each row is a lineage at the desired reporting rank. The columns are each query
+ used for gather, with the fraction match reported for each lineage. This format
+ is commonly used as input for many external multi-sample visualization tools.
+
+To generate `lineage_summary`, we add `--output-format lineage_summary` to the summarize
+ command, and need to specify a rank to summarize. Here's the command for reporting
+ `lineage_summary` for two queries (HSMA33MX, PSM6XBW3) summary at `species` level.
+
+```
+sourmash tax metagenome
+    --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
+    --gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
+    --output-format krona --rank species
+```
+
+example `lineage_summary`:
+
+```
+lineage HSMA33MX   PSM6XBW3
+d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus    0.015637726014008795    0.015642822225843248
+d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri 0.05701254275940707     0.05703112269838684
+d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli       0.05815279361459521     0.05817174515235457
+```
+
+To produce multiple output types from the same command, add the types into the
+ `--output-format` argument, e.g. `--output-format summary krona lineage_summary`
+
+
+### `sourmash tax genome` - classify a genome using `gather` results
+
+`sourmash tax genome` reports likely classification for each query,
+ based on `gather` matches. By default, classification requires at least 10% of
+ the query to be matched. Thus, if 10% of the query was matched to a species, the
+ species-level classification can be reported. However, if 7% of the query was
+ matched to one species, and an additional 5% matched to a different species in
+ the same genus, the genus-level classification will be reported.
+
+Optionally, `genome` can instead report classifications at a desired `rank`,
+ regardless of match threshold (`--rank` argument, e.g. `--rank species`).
+
+Note that these thresholds and strategies are under active testing.
+
+To illustrate the utility of `genome`, let's consider a signature consisting
+ of two different Shewanella strains, `Shewanella baltica OS185 strain=OS185`
+ and `Shewanella baltica OS223 strain=OS223`. For simplicity, we gave this query
+ the name "Sb47+63".
+
+When we gather this signature against the `gtdb-rs202` representatives database,
+we see 66% matches to one strain, and 33% to the other:
+
+abbreviated gather_csv:
+
+```
+f_match,f_unique_weighted,name,query_name
+0.664,1.0,0.664,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",Sb47+63
+0.656,0.511,0.335,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",Sb47+63
+```
+
+> Here, `f_match` shows that independently, both strains match ~65% percent of
+ this mixed query. The `f_unique_weighted` column has the results of gather-style
+ decomposition. As the OS223 strain had a slightly higher `f_match` (66%), it
+ was the first match. The remaining 33% of the query matched to strain OS185.
+
+Here, we use this gather csv to classify our "Sb47+63" mixed-strain query.
+
+Example command to classify this query from the `gather` csv, using
+the default classification threshold (0.1).
+
+```
+sourmash tax genome
+    --gather-csv 47+63_x_gtdb-rs202.gather.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv
+```
+
+There are two possible output formats, `csv_summary` and `krona`.
+
+#### `csv_summary` output format
+
+`csv_summary` is the default output format. This outputs a `csv` with lineage
+ summarization for each taxonomic rank. This output currently consists of four
+ columns, `query_name,rank,fraction,lineage`, where `fraction` is the  fraction
+ of the query matched to the reported rank and lineage. The `status` column
+ provides additional information on the classification. The `status` options are:
+
+  - `match` - this query was classified
+  - `nomatch`- this query could not be classified
+  - `below_threshold` - this query was classified at the specified rank,
+     but the query fraction matched was below the containment threshold
+
+Here is the `csv_summary` output from classifying this mixed-strain Shewanella query to
+species level:
+
+```
+query_name,status,rank,fraction,lineage
+"NC_009665.1 Shewanella baltica OS185, complete genome",match,species,1.000,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Shewanellaceae;g__Shewanella;s__Shewanella baltica
+```
+>Here, we see that the match percentages to both strains have been aggregated,
+and we have 100% species-level `Shewanella baltica` annotation.
+
+#### `krona` output format
+
+`krona` format is a tab-separated list of these results at a specific rank.
+  The first column, `fraction` is the fraction of the query matched to the
+  reported rank and lineage. The remaining columns are `superkingdom`, `phylum`,
+  ... etc down to the rank used for summarization. This output can be used
+  directly for `krona` visualization.
+
+To generate `krona`, we must classify by `--rank` instead of using the
+ classification threshold. For the command, we add `--output-format krona`
+  and `--rank <RANK>` to the command above. Here's the command for producing
+  `krona` output for `species`-level classifications:
+
+```
+sourmash tax genome
+    --gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
+    --output-format krona --rank species
+```
+> Note that specifying `--rank` forces classification by rank rather than
+by the containment threshold.
+
+Here is the `krona`-formatted output for this command:
+
+```
+fraction        superkingdom    phylum  class   order   family  genus   species
+Sb47+63   1.0     d__Bacteria     p__Proteobacteria       c__Gammaproteobacteria  o__Enterobacterales     f__Shewanellaceae       g__Shewanella   s__Shewanella baltica
+```
+
+To produce multiple output types from the same command, add the types into the
+ `--output-format` argument, e.g. `--output-format csv_summary krona`.
+ **Note that specifying the classification rank with `--rank`,
+ (e.g. `--rank species`), as needed for `krona` output, forces classification
+ by `rank` rather than by containment threshold.** If the query
+ classification at this rank does not meet the containment threshold
+ (default=0.1), the `status` column will contain `below_threshold`.
+
+
+### `sourmash tax annotate` - annotates gather output with taxonomy
+
+`sourmash tax annotate` adds a column with taxonomic lineage information
+ for each database match to gather output. Do not summarize or classify.
+ Note that this is not required for either `summarize` or `classify`.
+
+By default, `annotate` uses the name of each input gather csv to write an updated
+ version with lineages information. For example, annotating `sample1.gather.csv`
+ would produce `sample1.gather.with-lineages.csv`
+
+```
+sourmash tax annotate
+    --gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv
+```
+> This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`
+
+
+## `sourmash lca` subcommands for in-memory taxonomy integration
 
 These commands use LCA databases (created with `lca index`, below, or
 prepared databases such as

diff --git a/src/sourmash/__init__.py b/src/sourmash/__init__.py
@@ -111,6 +111,7 @@ def search_sbt_index(*args, **kwargs):
 
 from .sbtmh import create_sbt_index
 from . import lca
+from . import tax
 from . import sbt
 from . import sbtmh
 from . import sbt_storage

diff --git a/src/sourmash/cli/__init__.py b/src/sourmash/cli/__init__.py
@@ -37,6 +37,7 @@
 from . import sig as signature
 from . import sketch
 from . import storage
+from . import tax
 
 
 class SourmashParser(ArgumentParser):
@@ -92,6 +93,7 @@ def parse_args(self, args=None, namespace=None):
 
 def get_parser():
     module_descs = {
+        'tax': 'Integrate taxonomy information based on "gather" results',
         'lca': 'Taxonomic operations',
         'sketch': 'Create signatures',
         'sig': 'Manipulate signature files',

diff --git a/src/sourmash/cli/lca/index.py b/src/sourmash/cli/lca/index.py
@@ -41,7 +41,11 @@ def subparser(subparsers):
     )
     subparser.add_argument(
         '--split-identifiers', action='store_true',
-        help='split names in signatures on whitspace and period'
+        help='split names in signatures on whitespace'
+    )
+    subparser.add_argument(
+        '--keep-identifier-versions', action='store_true',
+        help='do not remove accession versions'
     )
     subparser.add_argument('-f', '--force', action='store_true')
     subparser.add_argument(
@@ -51,6 +55,10 @@ def subparser(subparsers):
         '--require-taxonomy', action='store_true',
         help='ignore signatures with no taxonomy entry'
     )
+    subparser.add_argument(
+        '--fail-on-missing-taxonomy', action='store_true',
+        help='fail quickly if taxonomy is not available for an identifier',
+    )
 
     add_ksize_arg(subparser, 31)
     add_moltype_args(subparser)

diff --git a/src/sourmash/cli/tax/__init__.py b/src/sourmash/cli/tax/__init__.py
@@ -0,0 +1,32 @@
+"""Define the command line interface for sourmash tax
+
+The top level CLI is defined in ../__init__.py. This module defines the CLI for
+`sourmash tax` operations.
+"""
+
+from . import metagenome
+from . import genome
+from . import annotate
+from ..utils import command_list
+from argparse import SUPPRESS, RawDescriptionHelpFormatter
+import os
+import sys
+
+
+def subparser(subparsers):
+    subparser = subparsers.add_parser('tax', formatter_class=RawDescriptionHelpFormatter, usage=SUPPRESS, aliases=['taxonomy'])
+    desc = 'Operations\n'
+    clidir = os.path.dirname(__file__)
+    ops = command_list(clidir)
+    for subcmd in ops:
+        docstring = getattr(sys.modules[__name__], subcmd).__doc__
+        helpstring = 'sourmash tax {op:s} --help'.format(op=subcmd)
+        desc += '        {hs:33s} {ds:s}\n'.format(hs=helpstring, ds=docstring)
+    s = subparser.add_subparsers(
+        title="Integrate taxonomy information based on 'gather' results", dest='subcmd', metavar='subcmd', help=SUPPRESS,
+        description=desc
+    )
+    for subcmd in ops:
+        getattr(sys.modules[__name__], subcmd).subparser(s)
+    subparser._action_groups.reverse()
+    subparser._optionals.title = 'Options'