Skip to content

Commit

Permalink
[MRG] add taxonomy subcommand (#1543)
Browse files Browse the repository at this point in the history
* provide a kind of ridiculous upgrade to lca index to better deal with identifiers/taxonomy

* update load_taxonomy_assignments to be more flexible and pay attention to CLI

* init structure for taxonomy subcommand

* more init

* syntax and add tax to init

* init tax tests

* working tax summarize command

* fix main

* init tests for new tax_utils

* add ascending taxlist

* init classify cmd

* init tax cli testing

* fix filename

* change to function for classify threshold

* add header

* enable single gather result for summarize; mult for classify

* add util script to take output of tax and format for krona viz (#1559)

* get summarized working for summary and krona output

* init test krona output

* add write_summary function

* get classify working again, both summary and krona output

* test write_classification

* init classify cli tests

* init tests for load_taxonomy_assignments

* enable force for getting past duplicated entries in taxonomy csv

* handle and test missing taxonomy info

* classify: handle, test empty gather results, gather results from csv

* split identifiers by default

* standardize spacing

* comments

* init add tax to docs

* [MRG] add a function to take multiple sourmash tax summarize csvs and output a single "abundance" df (#1562)

* add a function to take multiple sourmash tax summarize csvs and output a single 'abundance' df

* add import dep.

* rework format_tax_to_frac for easier testing and use; add tests

* better name and docstring for agg_sumgather_csvs_by_lineage

* init combine command

* debugging code to help track down SBT duplicates/loss problem

* fix, I think

* remove unnecessary code

* add test for duplicate signatures in SBT creation

* see what happens when you run twice

* add missing signatures, oops

* initial refactoring that passes many tests

* factor filename generation out of actual writing

* refactor and cleanup

* add more sigs to test, add note of concern L)

* fix --append tests, too

* refactor out save_exact in favor if save(..., overwrite=True)

* fix some storage stuff in the tests

* make test less confusing?

* Update src/sourmash/sbt_storage.py

Co-authored-by: Luiz Irber <[email protected]>

* define list_sbts() on base Storage class

* properly record duplicate signature names

* move threshold arg parsing into cli/utils

* init changes for multiquery input

* use namedtuple for summarized gather results

* init update for mult files

* adjust for namedtuple output

* mods for namedtuple

* upd utils

* add --from-file to summarize

* working multifile summarize

* --from-csv to --from-file

* somewhat working classify again

* updated classify

* finish fixing combine test

* make taxonomy_csv required

* cleanup

* more cleanup

* use load_pathlist_from_file

* test check_and_load_gather_csvs

* properly restrict kwargs with *

* allow lineage summary table output from summarize

* require rank for krona, lineage summary output formats

* add test for lineage summary output with format_lineage

* add docstrings

* punt cami to separate PR

* raise ValueError on empty gather results

* move notify

* cleanup

* remove tax combine; add tax label

* verson of load_taxonomy that strictly uses headers

* use new tax fn; enable mult taxonomy inputs

* init tax docs

* add classification status to classify output

* add multi db gather test csv

* fix typo

* whoops, actually fix

* handle accession in lineage csv header

* fix line width

* return available ranks from load_taxonomy_csv

* [MRG] add test to confirm failure when summarizing on empty gather (#1560)

* added summarize on empty gather.csv test

* added empty taxonomy csv test

* fix typo

* Update test_tax.py

* trouble shoot tests

* troubleshooting empty gather test

* fixed, maybe? (#1596)

* trying to pull

* cleaned up test_summarize_empty_gather_tax...

* cleaned test_summarize_empty_gather

* fixed test comments

* removed comment

* Update tests/test_tax.py

Co-authored-by: Tessa Pierce Ward <[email protected]>

* Update tests/test_tax.py

Co-authored-by: Tessa Pierce Ward <[email protected]>

* updated empty-gather test

Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: C. Titus Brown <[email protected]>

* init standardize errs

* add good valueerror for empty lineage csv file

* better catch errs in __main__; test all cmds: empty gather, lineage files

* check available ranks, bad gather headers, empty gather, etc

* emit one (and only one) warning per 100% match

* add all functions to __all__ ...is this desired?

* change cli for better arg parsing; add examples to summarize docs

* add mixed strain classify example and assoc data

* doc formatting

* more doc formatting

* more tiny reformatting

* Apply suggestions from code review

Co-authored-by: C. Titus Brown <[email protected]>

* add usage info for each subcommand; upd docs

* minor cleanup and more informative warning

* add checks for duplicated queries, better output for duplicated filenames; label --> annotate

* better catch/print for valueerrors; pyflakes fixes

* summary --> csv_summary

* upd docs

* two-sample lineage summary example

* rename commands

* upd function names too

* minor doc upds

* use f_unique_to_query

* add output dir; add file output notifications

* add cols to summary outputs; adjust accordingly

* trigger GitHub actions

Co-authored-by: C. Titus Brown <[email protected]>
Co-authored-by: Taylor Reiter <[email protected]>
Co-authored-by: Luiz Irber <[email protected]>
Co-authored-by: Hannah Eve Houts <[email protected]>
  • Loading branch information
5 people authored Jun 23, 2021
1 parent c04f137 commit d473199
Show file tree
Hide file tree
Showing 21 changed files with 3,557 additions and 16 deletions.
281 changes: 277 additions & 4 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,16 +70,26 @@ There are seven main subcommands: `sketch`, `compare`, `plot`,
* `prefetch` selects signatures of interest from a very large collection of signatures, for later processing.

There are also a number of commands that work with taxonomic
information; these are grouped under the `sourmash lca`
subcommand. See [the LCA tutorial](tutorials-lca.md) for a
walkthrough of these commands.
information; these are grouped under the `sourmash tax` and
`sourmash lca` subcommands.

`sourmash tax` commands:

* `tax metagenome` - summarize metagenome gather results at each taxonomic rank.
* `tax genome` - summarize single-genome gather results and report most likely classification.
* `tax annotate` - annotate gather results with lineage information (no summarization or classification).

`sourmash lca` commands:

* `lca classify` classifies many signatures against an LCA database.
* `lca summarize` summarizes the content of metagenomes using an LCA database.
* `lca index` creates a database for use with LCA subcommands.
* `lca rankinfo` summarizes the content of a database.
* `lca compare_csv` compares lineage spreadsheets, e.g. those output by `lca classify`.

> See [the LCA tutorial](tutorials-lca.md) for a
walkthrough of some of these commands.

Finally, there are a number of utility and information commands:

* `info` shows version and software information.
Expand Down Expand Up @@ -411,7 +421,270 @@ This combination of commands ensures that the more time- and
memory-intensive `gather` step is run only on a small set of relevant
signatures, rather than all the signatures in the database.

## `sourmash lca` subcommands for taxonomic classification
## `sourmash tax` subcommands for integrating taxonomic information into gather results

The sourmash `tax` or `taxonomy` commands integrate taxonomic
information into the results of `sourmash gather`. All `tax` commands
require a properly formatted `taxonomy` csv file that corresponds to
the database used for `gather`. For supported databases (e.g. GTDB, NCBI),
we provide these files, but they can also be generated for user-generated
databases. For more information, see [databases](databases.md).

These commands rely upon the fact that `gather` provides both the total
fraction of the query matched to each database matched, as well as a
non-overlapping `f_unique_to_query` which is the fraction of the query
uniquely matched to each reference genome. The `f_unique_to_query` for
any reference match will always be between (0% of query matched) and 1
(100% of query matched), and for a query matched to multiple references,
the `f_unique_to_query` will sum to at most 1 (100% of query matched).
We use this property to aggregate gather matches at the desired
taxonomic rank. For example, if the gather results for a metagenome
include results for 30 different strains of a given species, we can sum
the fraction uniquely matched to each strain to obtain the fraction
uniquely matched to this species. Note that this summarization can
also take into account abundance weighting; see
[Classifying Signatures](classifying-signatures.html) for more
information.

As with all reference-based analysis, results can be affected by the
completeness of the reference database. However, summarizing taxonomic
results from `gather` minimizes issues associated with increasing size
and redundancy of reference databases.

For more on how `gather` works and can be used to classify signatures, see
[classifying-signatures](classifying-signatures.html)


### `sourmash tax metagenome` - summarize metagenome content from `gather` results

`sourmash tax metagenome` summarizes gather results for each query by
taxonomic lineage.

example command to summarize a single `gather csv`, where the query was gathered
against `gtdb-rs202` representative species database:

```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
```

There are three possible output formats, `csv_summary`, `lineage_summary`, and
`krona`.

#### `csv_summary` output format

`csv_summary` is the default output format. This outputs a `csv` with lineage
summarization for each taxonomic rank. This output currently consists of four
columns, `query_name,rank,fraction,lineage`, where `fraction` is the fraction
of the query matched to the reported rank and lineage.

example `csv_summary` output from the command above:

```
query_name,rank,fraction,lineage
HSMA33MX,superkingdom,0.131,d__Bacteria
HSMA33MX,phylum,0.073,d__Bacteria;p__Bacteroidota
HSMA33MX,phylum,0.058,d__Bacteria;p__Proteobacteria
.
.
.
HSMA33MX,species,0.058,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;
o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli
HSMA33MX,species,0.057,d__Bacteria;p__Bacteroidota;c__Bacteroidia;
o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri
HSMA33MX,species,0.016,d__Bacteria;p__Bacteroidota;c__Bacteroidia;
o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus
```

#### `krona` output format

`krona` format is a tab-separated list of these results at a specific rank.
The first column, `fraction` is the fraction of the query matched to the
reported rank and lineage. The remaining columns are `superkingdom`, `phylum`,
... etc down to the rank used for summarization. This output can be used
directly for summary visualization.

To generate `krona`, we add `--output-format krona` to the command above, and
need to specify a rank to summarize. Here's the command for reporting `krona`
summary at `species` level:

```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```

example krona output from this command:

```
fraction superkingdom phylum class order family genus species
0.05815279361459521 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli
0.05701254275940707 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella Prevotella copri
0.015637726014008795 Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides vulgatus
```

#### `lineage_summary` output format

The lineage summary format is most useful when comparing across metagenome queries.
Each row is a lineage at the desired reporting rank. The columns are each query
used for gather, with the fraction match reported for each lineage. This format
is commonly used as input for many external multi-sample visualization tools.

To generate `lineage_summary`, we add `--output-format lineage_summary` to the summarize
command, and need to specify a rank to summarize. Here's the command for reporting
`lineage_summary` for two queries (HSMA33MX, PSM6XBW3) summary at `species` level.

```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```

example `lineage_summary`:

```
lineage HSMA33MX PSM6XBW3
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus 0.015637726014008795 0.015642822225843248
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri 0.05701254275940707 0.05703112269838684
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli 0.05815279361459521 0.05817174515235457
```

To produce multiple output types from the same command, add the types into the
`--output-format` argument, e.g. `--output-format summary krona lineage_summary`


### `sourmash tax genome` - classify a genome using `gather` results

`sourmash tax genome` reports likely classification for each query,
based on `gather` matches. By default, classification requires at least 10% of
the query to be matched. Thus, if 10% of the query was matched to a species, the
species-level classification can be reported. However, if 7% of the query was
matched to one species, and an additional 5% matched to a different species in
the same genus, the genus-level classification will be reported.

Optionally, `genome` can instead report classifications at a desired `rank`,
regardless of match threshold (`--rank` argument, e.g. `--rank species`).

Note that these thresholds and strategies are under active testing.

To illustrate the utility of `genome`, let's consider a signature consisting
of two different Shewanella strains, `Shewanella baltica OS185 strain=OS185`
and `Shewanella baltica OS223 strain=OS223`. For simplicity, we gave this query
the name "Sb47+63".

When we gather this signature against the `gtdb-rs202` representatives database,
we see 66% matches to one strain, and 33% to the other:

abbreviated gather_csv:

```
f_match,f_unique_weighted,name,query_name
0.664,1.0,0.664,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",Sb47+63
0.656,0.511,0.335,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",Sb47+63
```

> Here, `f_match` shows that independently, both strains match ~65% percent of
this mixed query. The `f_unique_weighted` column has the results of gather-style
decomposition. As the OS223 strain had a slightly higher `f_match` (66%), it
was the first match. The remaining 33% of the query matched to strain OS185.

Here, we use this gather csv to classify our "Sb47+63" mixed-strain query.

Example command to classify this query from the `gather` csv, using
the default classification threshold (0.1).

```
sourmash tax genome
--gather-csv 47+63_x_gtdb-rs202.gather.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
```

There are two possible output formats, `csv_summary` and `krona`.

#### `csv_summary` output format

`csv_summary` is the default output format. This outputs a `csv` with lineage
summarization for each taxonomic rank. This output currently consists of four
columns, `query_name,rank,fraction,lineage`, where `fraction` is the fraction
of the query matched to the reported rank and lineage. The `status` column
provides additional information on the classification. The `status` options are:

- `match` - this query was classified
- `nomatch`- this query could not be classified
- `below_threshold` - this query was classified at the specified rank,
but the query fraction matched was below the containment threshold

Here is the `csv_summary` output from classifying this mixed-strain Shewanella query to
species level:

```
query_name,status,rank,fraction,lineage
"NC_009665.1 Shewanella baltica OS185, complete genome",match,species,1.000,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Shewanellaceae;g__Shewanella;s__Shewanella baltica
```
>Here, we see that the match percentages to both strains have been aggregated,
and we have 100% species-level `Shewanella baltica` annotation.

#### `krona` output format

`krona` format is a tab-separated list of these results at a specific rank.
The first column, `fraction` is the fraction of the query matched to the
reported rank and lineage. The remaining columns are `superkingdom`, `phylum`,
... etc down to the rank used for summarization. This output can be used
directly for `krona` visualization.

To generate `krona`, we must classify by `--rank` instead of using the
classification threshold. For the command, we add `--output-format krona`
and `--rank <RANK>` to the command above. Here's the command for producing
`krona` output for `species`-level classifications:

```
sourmash tax genome
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```
> Note that specifying `--rank` forces classification by rank rather than
by the containment threshold.

Here is the `krona`-formatted output for this command:

```
fraction superkingdom phylum class order family genus species
Sb47+63 1.0 d__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Shewanellaceae g__Shewanella s__Shewanella baltica
```

To produce multiple output types from the same command, add the types into the
`--output-format` argument, e.g. `--output-format csv_summary krona`.
**Note that specifying the classification rank with `--rank`,
(e.g. `--rank species`), as needed for `krona` output, forces classification
by `rank` rather than by containment threshold.** If the query
classification at this rank does not meet the containment threshold
(default=0.1), the `status` column will contain `below_threshold`.


### `sourmash tax annotate` - annotates gather output with taxonomy

`sourmash tax annotate` adds a column with taxonomic lineage information
for each database match to gather output. Do not summarize or classify.
Note that this is not required for either `summarize` or `classify`.

By default, `annotate` uses the name of each input gather csv to write an updated
version with lineages information. For example, annotating `sample1.gather.csv`
would produce `sample1.gather.with-lineages.csv`

```
sourmash tax annotate
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
```
> This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`

## `sourmash lca` subcommands for in-memory taxonomy integration

These commands use LCA databases (created with `lca index`, below, or
prepared databases such as
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ def search_sbt_index(*args, **kwargs):

from .sbtmh import create_sbt_index
from . import lca
from . import tax
from . import sbt
from . import sbtmh
from . import sbt_storage
Expand Down
2 changes: 2 additions & 0 deletions src/sourmash/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
from . import sig as signature
from . import sketch
from . import storage
from . import tax


class SourmashParser(ArgumentParser):
Expand Down Expand Up @@ -92,6 +93,7 @@ def parse_args(self, args=None, namespace=None):

def get_parser():
module_descs = {
'tax': 'Integrate taxonomy information based on "gather" results',
'lca': 'Taxonomic operations',
'sketch': 'Create signatures',
'sig': 'Manipulate signature files',
Expand Down
10 changes: 9 additions & 1 deletion src/sourmash/cli/lca/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,11 @@ def subparser(subparsers):
)
subparser.add_argument(
'--split-identifiers', action='store_true',
help='split names in signatures on whitspace and period'
help='split names in signatures on whitespace'
)
subparser.add_argument(
'--keep-identifier-versions', action='store_true',
help='do not remove accession versions'
)
subparser.add_argument('-f', '--force', action='store_true')
subparser.add_argument(
Expand All @@ -51,6 +55,10 @@ def subparser(subparsers):
'--require-taxonomy', action='store_true',
help='ignore signatures with no taxonomy entry'
)
subparser.add_argument(
'--fail-on-missing-taxonomy', action='store_true',
help='fail quickly if taxonomy is not available for an identifier',
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
Expand Down
32 changes: 32 additions & 0 deletions src/sourmash/cli/tax/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""Define the command line interface for sourmash tax
The top level CLI is defined in ../__init__.py. This module defines the CLI for
`sourmash tax` operations.
"""

from . import metagenome
from . import genome
from . import annotate
from ..utils import command_list
from argparse import SUPPRESS, RawDescriptionHelpFormatter
import os
import sys


def subparser(subparsers):
subparser = subparsers.add_parser('tax', formatter_class=RawDescriptionHelpFormatter, usage=SUPPRESS, aliases=['taxonomy'])
desc = 'Operations\n'
clidir = os.path.dirname(__file__)
ops = command_list(clidir)
for subcmd in ops:
docstring = getattr(sys.modules[__name__], subcmd).__doc__
helpstring = 'sourmash tax {op:s} --help'.format(op=subcmd)
desc += ' {hs:33s} {ds:s}\n'.format(hs=helpstring, ds=docstring)
s = subparser.add_subparsers(
title="Integrate taxonomy information based on 'gather' results", dest='subcmd', metavar='subcmd', help=SUPPRESS,
description=desc
)
for subcmd in ops:
getattr(sys.modules[__name__], subcmd).subparser(s)
subparser._action_groups.reverse()
subparser._optionals.title = 'Options'
Loading

0 comments on commit d473199

Please sign in to comment.