sourmash-bio · bluegenes · Jun 23, 2021 · May 21, 2021 · May 21, 2021 · May 21, 2021
diff --git a/doc/command-line.md b/doc/command-line.md
@@ -70,16 +70,26 @@ There are seven main subcommands: `sketch`, `compare`, `plot`,
 * `prefetch` selects signatures of interest from a very large collection of signatures, for later processing.
 
 There are also a number of commands that work with taxonomic
-information; these are grouped under the `sourmash lca`
-subcommand. See [the LCA tutorial](tutorials-lca.md) for a
-walkthrough of these commands.
+information; these are grouped under the `sourmash tax` and
+`sourmash lca` subcommands.
+
+`sourmash tax` commands:
+
+* `tax metagenome` - summarize metagenome gather results at each taxonomic rank.
+* `tax genome`     - summarize single-genome gather results and report most likely classification.
+* `tax annotate`   - annotate gather results with lineage information (no summarization or classification).
+
+`sourmash lca` commands:
 
 * `lca classify` classifies many signatures against an LCA database.
 * `lca summarize` summarizes the content of metagenomes using an LCA database.
 * `lca index` creates a database for use with LCA subcommands.
 * `lca rankinfo` summarizes the content of a database.
 * `lca compare_csv` compares lineage spreadsheets, e.g. those output by `lca classify`.
 
+> See [the LCA tutorial](tutorials-lca.md) for a
+walkthrough of some of these commands.
+
 Finally, there are a number of utility and information commands:
 
 * `info` shows version and software information.
@@ -411,7 +421,268 @@ This combination of commands ensures that the more time- and
 memory-intensive `gather` step is run only on a small set of relevant
 signatures, rather than all the signatures in the database.
 
-## `sourmash lca` subcommands for taxonomic classification
+## `sourmash tax` subcommands for integrating taxonomic information
+
+The sourmash `tax` or `taxonomy` commands integrate taxonomic
+ information into the results of `sourmash gather`. All `tax` commands
+ require a properly formatted `taxonomy` csv file that corresponds to
+ the database used for `gather`. For supported databases (e.g. GTDB, NCBI),
+ we provide these files, but they can also be generated for user-generated
+ databases. For more information, see [databases](databases.md).
+
+These commands rely upon the fact that `gather` provides both the total
+ fraction of the query matched to each database matched, as well as a
+ non-overlapping `f_unique_weighted` which is the fraction of the query
+ (weighted by abundance, if tracked) uniquely matched to each reference
+ genome. The `f_unique_weighted` for any reference match will always be
+ between (0% of query matched) and 1 (100% of query matched), and for a
+ query matched to multiple references, the `f_unique_weighted` will sum
+ to at most 1 (100% of query matched). We use this property to aggregate
+ gather matches at the desired taxonomic rank. For example, if the gather
+ results for a metagenome include results for 30 different strains of a
+ given species, we can sum the fraction uniquely matched to each strain
+ to obtain the fraction uniquely matched to this species.
+
+As with all reference-based analysis, results can be affected by the
+ completeness of the reference database. However, summarizing taxonomic
+ results from `gather` minimizes issues associated with increasing size
+ and redundancy of reference databases.
+
+For more on how `gather` works and can be used to classify signatures, see
+ [classifying-signatures](classifying-signatures.html)
+
+
+### `sourmash tax metagenome` - summarize metagenome content from `gather` results
+
+`sourmash tax metagenome` summarizes gather results for each query by
+ taxonomic lineage.
+
+example command to summarize a single `gather csv`, where the query was gathered
+ against `gtdb-rs202` representative species database:
+
+```
+sourmash tax metagenome
+    --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv
+```
+
+There are three possible output formats, `csv_summary`, `lineage_summary`, and
+ `krona`.
+
+#### `csv_summary` output format
+
+`csv_summary` is the default output format. This outputs a `csv` with lineage
+ summarization for each taxonomic rank. This output currently consists of four
+ columns, `query_name,rank,fraction,lineage`, where `fraction` is the  fraction
+ of the query matched to the reported rank and lineage.
+
+example `csv_summary` output from the command above:
+
+```
+query_name,rank,fraction,lineage
+HSMA33MX,superkingdom,0.131,d__Bacteria
+HSMA33MX,phylum,0.073,d__Bacteria;p__Bacteroidota
+HSMA33MX,phylum,0.058,d__Bacteria;p__Proteobacteria
+.
+.
+.
+HSMA33MX,species,0.058,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;
+o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli
+HSMA33MX,species,0.057,d__Bacteria;p__Bacteroidota;c__Bacteroidia;
+o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri
+HSMA33MX,species,0.016,d__Bacteria;p__Bacteroidota;c__Bacteroidia;
+o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus
+```
+
+#### `krona` output format
+
+`krona` format is a tab-separated list of these results at a specific rank.
+ The first column, `fraction` is the fraction of the query matched to the
+ reported rank and lineage. The remaining columns are `superkingdom`, `phylum`,
+ ... etc down to the rank used for summarization. This output can be used
+ directly for summary visualization.
+
+To generate `krona`, we add `--output-format krona` to the command above, and
+ need to specify a rank to summarize. Here's the command for reporting `krona`
+ summary at `species` level:
+
+```
+sourmash tax metagenome
+    --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
+    --output-format krona --rank species
+```
+
+example krona output from this command:
+
+```
+fraction	superkingdom	phylum	class	order	family	genus	species
+0.05815279361459521	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Escherichia coli
+0.05701254275940707	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella	Prevotella copri
+0.015637726014008795	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	Bacteroides vulgatus
+```
+
+#### `lineage_summary` output format
+
+The lineage summary format is most useful when comparing across metagenome queries.
+ Each row is a lineage at the desired reporting rank. The columns are each query
+ used for gather, with the fraction match reported for each lineage. This format
+ is commonly used as input for many external multi-sample visualization tools.
+
+To generate `lineage_summary`, we add `--output-format lineage_summary` to the summarize
+ command, and need to specify a rank to summarize. Here's the command for reporting
+ `lineage_summary` for two queries (HSMA33MX, PSM6XBW3) summary at `species` level.
+
+```
+sourmash tax metagenome
+    --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
+    --gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
+    --output-format krona --rank species
+```
+
+example `lineage_summary`:
+
+```
+lineage HSMA33MX   PSM6XBW3
+d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus    0.015637726014008795    0.015642822225843248
+d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri 0.05701254275940707     0.05703112269838684
+d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli       0.05815279361459521     0.05817174515235457
+```
+
+To produce multiple output types from the same command, add the types into the
+ `--output-format` argument, e.g. `--output-format summary krona lineage_summary`
+
+
+### `sourmash tax genome` - classify a genome using `gather` results
+
+`sourmash tax genome` reports likely classification for each query,
+ based on `gather` matches. By default, classification requires at least 10% of
+ the query to be matched. Thus, if 10% of the query was matched to a species, the
+ species-level classification can be reported. However, if 7% of the query was
+ matched to one species, and an additional 5% matched to a different species in
+ the same genus, the genus-level classification will be reported.
+
+Optionally, `genome` can instead report classifications at a desired `rank`,
+ regardless of match threshold (`--rank` argument, e.g. `--rank species`).
+
+Note that these thresholds and strategies are under active testing.
+
+To illustrate the utility of `genome`, let's consider a signature consisting
+ of two different Shewanella strains, `Shewanella baltica OS185 strain=OS185`
+ and `Shewanella baltica OS223 strain=OS223`. For simplicity, we gave this query
+ the name "Sb47+63".
+
+When we gather this signature against the `gtdb-rs202` representatives database,
+we see 66% matches to one strain, and 33% to the other:
+
+abbreviated gather_csv:
+
+```
+f_match,f_unique_weighted,name,query_name
+0.664,1.0,0.664,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",Sb47+63
+0.656,0.511,0.335,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",Sb47+63
+```
+
+> Here, `f_match` shows that independently, both strains match ~65% percent of
+ this mixed query. The `f_unique_weighted` column has the results of gather-style
+ decomposition. As the OS223 strain had a slightly higher `f_match` (66%), it
+ was the first match. The remaining 33% of the query matched to strain OS185.
+
+Here, we use this gather csv to classify our "Sb47+63" mixed-strain query.
+
+Example command to classify this query from the `gather` csv, using
+the default classification threshold (0.1).
+
+```
+sourmash tax genome
+    --gather-csv 47+63_x_gtdb-rs202.gather.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv
+```
+
+There are two possible output formats, `csv_summary` and `krona`.
+
+#### `csv_summary` output format
+
+`csv_summary` is the default output format. This outputs a `csv` with lineage
+ summarization for each taxonomic rank. This output currently consists of four
+ columns, `query_name,rank,fraction,lineage`, where `fraction` is the  fraction
+ of the query matched to the reported rank and lineage. The `status` column
+ provides additional information on the classification. The `status` options are:
+
+  - `match` - this query was classified
+  - `nomatch`- this query could not be classified
+  - `below_threshold` - this query was classified at the specified rank,
+     but the query fraction matched was below the containment threshold
+
+Here is the `csv_summary` output from classifying this mixed-strain Shewanella query to
+species level:
+
+```
+query_name,status,rank,fraction,lineage
+"NC_009665.1 Shewanella baltica OS185, complete genome",match,species,1.000,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Shewanellaceae;g__Shewanella;s__Shewanella baltica
+```
+>Here, we see that the match percentages to both strains have been aggregated,
+and we have 100% species-level `Shewanella baltica` annotation.
+
+#### `krona` output format
+
+`krona` format is a tab-separated list of these results at a specific rank.
+  The first column, `fraction` is the fraction of the query matched to the
+  reported rank and lineage. The remaining columns are `superkingdom`, `phylum`,
+  ... etc down to the rank used for summarization. This output can be used
+  directly for `krona` visualization.
+
+To generate `krona`, we must classify by `--rank` instead of using the
+ classification threshold. For the command, we add `--output-format krona`
+  and `--rank <RANK>` to the command above. Here's the command for producing
+  `krona` output for `species`-level classifications:
+
+```
+sourmash tax genome
+    --gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
+    --output-format krona --rank species
+```
+
+Here is the `krona`-formatted output for this command:
+
+```
+fraction        superkingdom    phylum  class   order   family  genus   species
+Sb47+63   1.0     d__Bacteria     p__Proteobacteria       c__Gammaproteobacteria  o__Enterobacterales     f__Shewanellaceae       g__Shewanella   s__Shewanella baltica
+```
+
+Note here that specifying `--rank` forces classification by rank rather than
+by the containment threshold.
+
+To produce multiple output types from the same command, add the types into the
+ `--output-format` argument, e.g. `--output-format csv_summary krona`.
+ **Note that specifying the classification rank with `--rank`,
+ (e.g. `--rank species`), as needed for `krona` output, forces classification
+ by `rank` rather than by containment threshold.** If the query
+ classification at this rank does not meet the containment threshold
+ (default=0.1), the `status` column will contain `below_threshold`.
+
+
+### `sourmash tax annotate` - annotates gather output with taxonomy
+
+`sourmash tax annotate` adds a column with taxonomic lineage information
+ for each database match to gather output. Do not summarize or classify.
+ Note that this is not required for either `summarize` or `classify`.
+
+By default, `annotate` uses the name of each input gather csv to write an updated
+ version with lineages information. For example, annotating `sample1.gather.csv`
+ would produce `sample1.gather.with-lineages.csv`
+
+```
+sourmash tax annotate
+    --gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
+    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv
+```
+> This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`
+
+
+## `sourmash lca` subcommands for in-memory taxonomy integration
 
 These commands use LCA databases (created with `lca index`, below, or
 prepared databases such as

diff --git a/src/sourmash/__init__.py b/src/sourmash/__init__.py
@@ -111,6 +111,7 @@ def search_sbt_index(*args, **kwargs):
 
 from .sbtmh import create_sbt_index
 from . import lca
+from . import tax
 from . import sbt
 from . import sbtmh
 from . import sbt_storage

diff --git a/src/sourmash/cli/__init__.py b/src/sourmash/cli/__init__.py
@@ -37,6 +37,7 @@
 from . import sig as signature
 from . import sketch
 from . import storage
+from . import tax
 
 
 class SourmashParser(ArgumentParser):
@@ -92,6 +93,7 @@ def parse_args(self, args=None, namespace=None):
 
 def get_parser():
     module_descs = {
+        'tax': 'Integrate taxonomy information based on "gather" results',
         'lca': 'Taxonomic operations',
         'sketch': 'Create signatures',
         'sig': 'Manipulate signature files',

diff --git a/src/sourmash/cli/lca/index.py b/src/sourmash/cli/lca/index.py
@@ -41,7 +41,11 @@ def subparser(subparsers):
     )
     subparser.add_argument(
         '--split-identifiers', action='store_true',
-        help='split names in signatures on whitspace and period'
+        help='split names in signatures on whitespace'
+    )
+    subparser.add_argument(
+        '--keep-identifier-versions', action='store_true',
+        help='do not remove accession versions'
     )
     subparser.add_argument('-f', '--force', action='store_true')
     subparser.add_argument(
@@ -51,6 +55,10 @@ def subparser(subparsers):
         '--require-taxonomy', action='store_true',
         help='ignore signatures with no taxonomy entry'
     )
+    subparser.add_argument(
+        '--fail-on-missing-taxonomy', action='store_true',
+        help='fail quickly if taxonomy is not available for an identifier',
+    )
 
     add_ksize_arg(subparser, 31)
     add_moltype_args(subparser)

diff --git a/src/sourmash/cli/tax/__init__.py b/src/sourmash/cli/tax/__init__.py
@@ -0,0 +1,32 @@
+"""Define the command line interface for sourmash tax
+
+The top level CLI is defined in ../__init__.py. This module defines the CLI for
+`sourmash tax` operations.
+"""
+
+from . import metagenome
+from . import genome
+from . import annotate
+from ..utils import command_list
+from argparse import SUPPRESS, RawDescriptionHelpFormatter
+import os
+import sys
+
+
+def subparser(subparsers):
+    subparser = subparsers.add_parser('tax', formatter_class=RawDescriptionHelpFormatter, usage=SUPPRESS, aliases=['taxonomy'])
+    desc = 'Operations\n'
+    clidir = os.path.dirname(__file__)
+    ops = command_list(clidir)
+    for subcmd in ops:
+        docstring = getattr(sys.modules[__name__], subcmd).__doc__
+        helpstring = 'sourmash tax {op:s} --help'.format(op=subcmd)
+        desc += '        {hs:33s} {ds:s}\n'.format(hs=helpstring, ds=docstring)
+    s = subparser.add_subparsers(
+        title="Integrate taxonomy information based on 'gather' results", dest='subcmd', metavar='subcmd', help=SUPPRESS,
+        description=desc
+    )
+    for subcmd in ops:
+        getattr(sys.modules[__name__], subcmd).subparser(s)
+    subparser._action_groups.reverse()
+    subparser._optionals.title = 'Options'