Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add taxonomy subcommand #1543

Merged
merged 145 commits into from
Jun 23, 2021
Merged
Show file tree
Hide file tree
Changes from 138 commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
db32c56
provide a kind of ridiculous upgrade to lca index to better deal with…
ctb May 21, 2021
2c305ed
update load_taxonomy_assignments to be more flexible and pay attentio…
ctb May 21, 2021
1908205
init structure for taxonomy subcommand
bluegenes May 21, 2021
494dbbc
more init
bluegenes May 22, 2021
ac3a553
syntax and add tax to init
bluegenes May 22, 2021
1966afc
Merge branch 'latest' into add-taxonomy
bluegenes May 22, 2021
85ef9fa
Merge branch 'latest' into update/lca_index
ctb May 22, 2021
ce939e5
Merge branch 'latest' into add-taxonomy
bluegenes May 24, 2021
208a9b3
init tax tests
bluegenes May 25, 2021
7e822c6
Merge branch 'latest' into update/lca_index
bluegenes May 25, 2021
a5bc6f0
Merge branch 'update/lca_index' of https://github.com/dib-lab/sourmas…
bluegenes May 25, 2021
33c55f3
working tax summarize command
bluegenes May 25, 2021
850e4e4
fix main
bluegenes May 26, 2021
0029d51
init tests for new tax_utils
bluegenes May 27, 2021
ea2456a
add ascending taxlist
bluegenes May 27, 2021
1075b68
init classify cmd
bluegenes May 28, 2021
88b643b
init tax cli testing
bluegenes May 28, 2021
822d664
Merge branch 'latest' into add-taxonomy
bluegenes May 28, 2021
a26052d
fix filename
bluegenes May 28, 2021
d2fef53
Merge branch 'add-taxonomy' of https://github.com/dib-lab/sourmash in…
bluegenes May 28, 2021
1448985
change to function for classify threshold
bluegenes May 28, 2021
06daba4
add header
bluegenes May 28, 2021
bd26822
enable single gather result for summarize; mult for classify
bluegenes May 28, 2021
7b5fc72
add util script to take output of tax and format for krona viz (#1559)
taylorreiter May 28, 2021
63be068
Merge branch 'latest' into add-taxonomy
bluegenes May 28, 2021
2c5f864
get summarized working for summary and krona output
bluegenes May 28, 2021
c868e06
init test krona output
bluegenes May 28, 2021
bb04930
add write_summary function
bluegenes May 28, 2021
4fd6de0
get classify working again, both summary and krona output
bluegenes May 29, 2021
8d6f321
test write_classification
bluegenes May 29, 2021
147fed9
init classify cli tests
bluegenes May 29, 2021
b1a40a3
init tests for load_taxonomy_assignments
bluegenes May 31, 2021
6410f37
enable force for getting past duplicated entries in taxonomy csv
bluegenes May 31, 2021
454ca3a
handle and test missing taxonomy info
bluegenes May 31, 2021
65e7d5d
classify: handle, test empty gather results, gather results from csv
bluegenes May 31, 2021
eb980cd
split identifiers by default
bluegenes May 31, 2021
f83f741
standardize spacing
bluegenes May 31, 2021
28c5107
comments
bluegenes May 31, 2021
29b82e2
init add tax to docs
bluegenes Jun 1, 2021
0dbf6fb
[MRG] add a function to take multiple sourmash tax summarize csvs and…
taylorreiter Jun 1, 2021
277ab35
Merge branch 'add-taxonomy' of https://github.com/dib-lab/sourmash in…
bluegenes Jun 1, 2021
1625743
rework format_tax_to_frac for easier testing and use; add tests
bluegenes Jun 2, 2021
cbc4020
better name and docstring for agg_sumgather_csvs_by_lineage
bluegenes Jun 2, 2021
fc3de6d
init combine command
bluegenes Jun 2, 2021
b042aa4
Merge branch 'latest' into add-taxonomy
bluegenes Jun 2, 2021
6c117dd
Merge branch 'latest' into add-taxonomy
bluegenes Jun 2, 2021
721421d
debugging code to help track down SBT duplicates/loss problem
ctb Jun 4, 2021
0dd9fa3
fix, I think
ctb Jun 4, 2021
f4a5e2e
remove unnecessary code
ctb Jun 4, 2021
7c44fc7
add test for duplicate signatures in SBT creation
ctb Jun 4, 2021
9f7848f
Merge branch 'debug/sbt_dups_tests' into debug/sbt_dups
ctb Jun 4, 2021
61d88d2
see what happens when you run twice
ctb Jun 4, 2021
1074d62
add missing signatures, oops
ctb Jun 4, 2021
9fd5076
initial refactoring that passes many tests
ctb Jun 4, 2021
b9a63bb
factor filename generation out of actual writing
ctb Jun 4, 2021
aee6cf6
refactor and cleanup
ctb Jun 4, 2021
e4c69e3
add more sigs to test, add note of concern L)
ctb Jun 4, 2021
c909b00
fix --append tests, too
ctb Jun 4, 2021
cf1b74a
refactor out save_exact in favor if save(..., overwrite=True)
ctb Jun 4, 2021
15f02cd
fix some storage stuff in the tests
ctb Jun 4, 2021
578ba43
make test less confusing?
ctb Jun 4, 2021
b46b966
Update src/sourmash/sbt_storage.py
ctb Jun 4, 2021
5a61c0d
define list_sbts() on base Storage class
ctb Jun 4, 2021
2af87b6
Merge branch 'debug/sbt_dups' of github.com:dib-lab/sourmash into deb…
ctb Jun 4, 2021
6bf3b0a
Merge branch 'latest' of https://github.com/dib-lab/sourmash into upd…
ctb Jun 5, 2021
f362774
Merge branch 'debug/sbt_dups' into update/lca_index
ctb Jun 5, 2021
3b1063c
properly record duplicate signature names
ctb Jun 6, 2021
e9e275f
Merge branch 'latest' into add-taxonomy
bluegenes Jun 9, 2021
e9f4e46
move threshold arg parsing into cli/utils
bluegenes Jun 11, 2021
3ba3919
Merge branch 'add-taxonomy' of https://github.com/dib-lab/sourmash in…
bluegenes Jun 11, 2021
c796df3
init changes for multiquery input
bluegenes Jun 12, 2021
d50956b
use namedtuple for summarized gather results
bluegenes Jun 13, 2021
57d034e
init update for mult files
bluegenes Jun 13, 2021
6389a9d
adjust for namedtuple output
bluegenes Jun 13, 2021
b184baf
mods for namedtuple
bluegenes Jun 13, 2021
69e5d59
upd utils
bluegenes Jun 14, 2021
d4ee27d
add --from-file to summarize
bluegenes Jun 14, 2021
229c2da
working multifile summarize
bluegenes Jun 14, 2021
599f394
--from-csv to --from-file
bluegenes Jun 14, 2021
6e220f4
somewhat working classify again
bluegenes Jun 14, 2021
45a8dcd
updated classify
bluegenes Jun 14, 2021
25db3cb
finish fixing combine test
bluegenes Jun 14, 2021
3d33c13
make taxonomy_csv required
bluegenes Jun 14, 2021
bdb1628
cleanup
bluegenes Jun 14, 2021
165d750
more cleanup
bluegenes Jun 14, 2021
9bfc1b9
Merge branch 'latest' into add-taxonomy
bluegenes Jun 14, 2021
b79dc8b
use load_pathlist_from_file
bluegenes Jun 14, 2021
a1e5d87
test check_and_load_gather_csvs
bluegenes Jun 14, 2021
4f586b1
properly restrict kwargs with *
bluegenes Jun 14, 2021
0dfaba9
allow lineage summary table output from summarize
bluegenes Jun 14, 2021
c2cd08d
Merge branch 'latest' into add-taxonomy
bluegenes Jun 14, 2021
dce5940
fix merge
ctb Jun 15, 2021
443e122
require rank for krona, lineage summary output formats
bluegenes Jun 15, 2021
ed655b5
add test for lineage summary output with format_lineage
bluegenes Jun 15, 2021
98d3d39
add docstrings
bluegenes Jun 15, 2021
92f796a
punt cami to separate PR
bluegenes Jun 15, 2021
cba072e
raise ValueError on empty gather results
bluegenes Jun 15, 2021
9c5ff01
move notify
bluegenes Jun 15, 2021
13f5f00
cleanup
bluegenes Jun 16, 2021
891d951
remove tax combine; add tax label
bluegenes Jun 16, 2021
3e330b5
verson of load_taxonomy that strictly uses headers
bluegenes Jun 17, 2021
16d5d7f
use new tax fn; enable mult taxonomy inputs
bluegenes Jun 17, 2021
f884010
Merge branch 'latest' into add-taxonomy
bluegenes Jun 17, 2021
f46ab7d
init tax docs
bluegenes Jun 17, 2021
8a41e94
add classification status to classify output
bluegenes Jun 17, 2021
89e448b
add multi db gather test csv
bluegenes Jun 17, 2021
86a53f8
fix typo
bluegenes Jun 17, 2021
a9366bd
whoops, actually fix
bluegenes Jun 17, 2021
7044ae5
handle accession in lineage csv header
bluegenes Jun 17, 2021
24bea0f
fix line width
bluegenes Jun 17, 2021
3cbac9f
return available ranks from load_taxonomy_csv
bluegenes Jun 17, 2021
bee47fd
[MRG] add test to confirm failure when summarizing on empty gather (#…
hehouts Jun 17, 2021
7d41b87
init standardize errs
bluegenes Jun 17, 2021
b0501c2
Merge branch 'add-taxonomy' of https://github.com/sourmash-bio/sourma…
bluegenes Jun 17, 2021
643a62c
add good valueerror for empty lineage csv file
bluegenes Jun 17, 2021
43e3609
better catch errs in __main__; test all cmds: empty gather, lineage f…
bluegenes Jun 17, 2021
4b1ded4
check available ranks, bad gather headers, empty gather, etc
bluegenes Jun 18, 2021
4fdabb7
emit one (and only one) warning per 100% match
bluegenes Jun 18, 2021
0e6df72
Merge branch 'latest' into add-taxonomy
bluegenes Jun 18, 2021
e495d1d
add all functions to __all__ ...is this desired?
bluegenes Jun 18, 2021
ac748dc
Merge branch 'latest' into add-taxonomy
bluegenes Jun 18, 2021
e8f0b68
change cli for better arg parsing; add examples to summarize docs
bluegenes Jun 18, 2021
4c34f3c
Merge branch 'add-taxonomy' of https://github.com/sourmash-bio/sourma…
bluegenes Jun 18, 2021
42093c6
add mixed strain classify example and assoc data
bluegenes Jun 18, 2021
46f2af0
doc formatting
bluegenes Jun 18, 2021
a9cd0ff
more doc formatting
bluegenes Jun 18, 2021
1ee0f0a
more tiny reformatting
bluegenes Jun 18, 2021
d161431
Apply suggestions from code review
bluegenes Jun 20, 2021
2795c8d
add usage info for each subcommand; upd docs
bluegenes Jun 20, 2021
f2d0473
minor cleanup and more informative warning
bluegenes Jun 20, 2021
e677908
add checks for duplicated queries, better output for duplicated filen…
bluegenes Jun 21, 2021
a35bfa9
better catch/print for valueerrors; pyflakes fixes
bluegenes Jun 21, 2021
0dc997e
summary --> csv_summary
bluegenes Jun 21, 2021
c5ccf67
upd docs
bluegenes Jun 21, 2021
27260dc
two-sample lineage summary example
bluegenes Jun 21, 2021
4f33026
rename commands
bluegenes Jun 21, 2021
18294fd
upd function names too
bluegenes Jun 21, 2021
b6ef1f2
Merge branch 'latest' into add-taxonomy
bluegenes Jun 21, 2021
731be75
minor doc upds
bluegenes Jun 21, 2021
2760786
use f_unique_to_query
bluegenes Jun 21, 2021
d68e3d0
add output dir; add file output notifications
bluegenes Jun 22, 2021
d5d7e51
add cols to summary outputs; adjust accordingly
bluegenes Jun 22, 2021
c81cab8
Merge branch 'latest' into add-taxonomy
bluegenes Jun 22, 2021
d5c51d4
trigger GitHub actions
ctb Jun 23, 2021
ab2a4c8
Merge branch 'add-taxonomy' of github.com:dib-lab/sourmash into add-t…
ctb Jun 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
279 changes: 275 additions & 4 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,16 +70,26 @@ There are seven main subcommands: `sketch`, `compare`, `plot`,
* `prefetch` selects signatures of interest from a very large collection of signatures, for later processing.

There are also a number of commands that work with taxonomic
information; these are grouped under the `sourmash lca`
subcommand. See [the LCA tutorial](tutorials-lca.md) for a
walkthrough of these commands.
information; these are grouped under the `sourmash tax` and
`sourmash lca` subcommands.

`sourmash tax` commands:

* `tax metagenome` - summarize metagenome gather results at each taxonomic rank.
* `tax genome` - summarize single-genome gather results and report most likely classification.
* `tax annotate` - annotate gather results with lineage information (no summarization or classification).

`sourmash lca` commands:

* `lca classify` classifies many signatures against an LCA database.
* `lca summarize` summarizes the content of metagenomes using an LCA database.
* `lca index` creates a database for use with LCA subcommands.
* `lca rankinfo` summarizes the content of a database.
* `lca compare_csv` compares lineage spreadsheets, e.g. those output by `lca classify`.

> See [the LCA tutorial](tutorials-lca.md) for a
walkthrough of some of these commands.

Finally, there are a number of utility and information commands:

* `info` shows version and software information.
Expand Down Expand Up @@ -411,7 +421,268 @@ This combination of commands ensures that the more time- and
memory-intensive `gather` step is run only on a small set of relevant
signatures, rather than all the signatures in the database.

## `sourmash lca` subcommands for taxonomic classification
## `sourmash tax` subcommands for integrating taxonomic information

The sourmash `tax` or `taxonomy` commands integrate taxonomic
information into the results of `sourmash gather`. All `tax` commands
require a properly formatted `taxonomy` csv file that corresponds to
the database used for `gather`. For supported databases (e.g. GTDB, NCBI),
we provide these files, but they can also be generated for user-generated
databases. For more information, see [databases](databases.md).

These commands rely upon the fact that `gather` provides both the total
fraction of the query matched to each database matched, as well as a
non-overlapping `f_unique_weighted` which is the fraction of the query
bluegenes marked this conversation as resolved.
Show resolved Hide resolved
(weighted by abundance, if tracked) uniquely matched to each reference
genome. The `f_unique_weighted` for any reference match will always be
between (0% of query matched) and 1 (100% of query matched), and for a
query matched to multiple references, the `f_unique_weighted` will sum
to at most 1 (100% of query matched). We use this property to aggregate
gather matches at the desired taxonomic rank. For example, if the gather
results for a metagenome include results for 30 different strains of a
given species, we can sum the fraction uniquely matched to each strain
to obtain the fraction uniquely matched to this species.

As with all reference-based analysis, results can be affected by the
completeness of the reference database. However, summarizing taxonomic
results from `gather` minimizes issues associated with increasing size
and redundancy of reference databases.

For more on how `gather` works and can be used to classify signatures, see
[classifying-signatures](classifying-signatures.html)


### `sourmash tax metagenome` - summarize metagenome content from `gather` results

`sourmash tax metagenome` summarizes gather results for each query by
taxonomic lineage.

example command to summarize a single `gather csv`, where the query was gathered
against `gtdb-rs202` representative species database:

```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
```

There are three possible output formats, `csv_summary`, `lineage_summary`, and
`krona`.

#### `csv_summary` output format

`csv_summary` is the default output format. This outputs a `csv` with lineage
summarization for each taxonomic rank. This output currently consists of four
columns, `query_name,rank,fraction,lineage`, where `fraction` is the fraction
of the query matched to the reported rank and lineage.

example `csv_summary` output from the command above:

```
query_name,rank,fraction,lineage
HSMA33MX,superkingdom,0.131,d__Bacteria
HSMA33MX,phylum,0.073,d__Bacteria;p__Bacteroidota
HSMA33MX,phylum,0.058,d__Bacteria;p__Proteobacteria
.
.
.
HSMA33MX,species,0.058,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;
o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli
HSMA33MX,species,0.057,d__Bacteria;p__Bacteroidota;c__Bacteroidia;
o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri
HSMA33MX,species,0.016,d__Bacteria;p__Bacteroidota;c__Bacteroidia;
o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus
```

#### `krona` output format

`krona` format is a tab-separated list of these results at a specific rank.
The first column, `fraction` is the fraction of the query matched to the
reported rank and lineage. The remaining columns are `superkingdom`, `phylum`,
... etc down to the rank used for summarization. This output can be used
directly for summary visualization.

To generate `krona`, we add `--output-format krona` to the command above, and
need to specify a rank to summarize. Here's the command for reporting `krona`
summary at `species` level:

```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```

example krona output from this command:

```
fraction superkingdom phylum class order family genus species
0.05815279361459521 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli
0.05701254275940707 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella Prevotella copri
0.015637726014008795 Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides vulgatus
```

#### `lineage_summary` output format

The lineage summary format is most useful when comparing across metagenome queries.
Each row is a lineage at the desired reporting rank. The columns are each query
used for gather, with the fraction match reported for each lineage. This format
is commonly used as input for many external multi-sample visualization tools.

To generate `lineage_summary`, we add `--output-format lineage_summary` to the summarize
command, and need to specify a rank to summarize. Here's the command for reporting
`lineage_summary` for two queries (HSMA33MX, PSM6XBW3) summary at `species` level.

```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```

example `lineage_summary`:

```
lineage HSMA33MX PSM6XBW3
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus 0.015637726014008795 0.015642822225843248
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri 0.05701254275940707 0.05703112269838684
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli 0.05815279361459521 0.05817174515235457
```

To produce multiple output types from the same command, add the types into the
`--output-format` argument, e.g. `--output-format summary krona lineage_summary`


### `sourmash tax genome` - classify a genome using `gather` results

`sourmash tax genome` reports likely classification for each query,
based on `gather` matches. By default, classification requires at least 10% of
the query to be matched. Thus, if 10% of the query was matched to a species, the
species-level classification can be reported. However, if 7% of the query was
matched to one species, and an additional 5% matched to a different species in
the same genus, the genus-level classification will be reported.

Optionally, `genome` can instead report classifications at a desired `rank`,
regardless of match threshold (`--rank` argument, e.g. `--rank species`).

Note that these thresholds and strategies are under active testing.

To illustrate the utility of `genome`, let's consider a signature consisting
of two different Shewanella strains, `Shewanella baltica OS185 strain=OS185`
and `Shewanella baltica OS223 strain=OS223`. For simplicity, we gave this query
the name "Sb47+63".

When we gather this signature against the `gtdb-rs202` representatives database,
we see 66% matches to one strain, and 33% to the other:

abbreviated gather_csv:

```
f_match,f_unique_weighted,name,query_name
0.664,1.0,0.664,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",Sb47+63
0.656,0.511,0.335,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",Sb47+63
```

> Here, `f_match` shows that independently, both strains match ~65% percent of
this mixed query. The `f_unique_weighted` column has the results of gather-style
decomposition. As the OS223 strain had a slightly higher `f_match` (66%), it
was the first match. The remaining 33% of the query matched to strain OS185.

Here, we use this gather csv to classify our "Sb47+63" mixed-strain query.

Example command to classify this query from the `gather` csv, using
the default classification threshold (0.1).

```
sourmash tax genome
--gather-csv 47+63_x_gtdb-rs202.gather.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
```

There are two possible output formats, `csv_summary` and `krona`.

#### `csv_summary` output format

`csv_summary` is the default output format. This outputs a `csv` with lineage
summarization for each taxonomic rank. This output currently consists of four
columns, `query_name,rank,fraction,lineage`, where `fraction` is the fraction
of the query matched to the reported rank and lineage. The `status` column
provides additional information on the classification. The `status` options are:

- `match` - this query was classified
- `nomatch`- this query could not be classified
- `below_threshold` - this query was classified at the specified rank,
but the query fraction matched was below the containment threshold

Here is the `csv_summary` output from classifying this mixed-strain Shewanella query to
species level:

```
query_name,status,rank,fraction,lineage
"NC_009665.1 Shewanella baltica OS185, complete genome",match,species,1.000,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Shewanellaceae;g__Shewanella;s__Shewanella baltica
```
>Here, we see that the match percentages to both strains have been aggregated,
and we have 100% species-level `Shewanella baltica` annotation.

#### `krona` output format

`krona` format is a tab-separated list of these results at a specific rank.
The first column, `fraction` is the fraction of the query matched to the
reported rank and lineage. The remaining columns are `superkingdom`, `phylum`,
... etc down to the rank used for summarization. This output can be used
directly for `krona` visualization.

To generate `krona`, we must classify by `--rank` instead of using the
classification threshold. For the command, we add `--output-format krona`
and `--rank <RANK>` to the command above. Here's the command for producing
`krona` output for `species`-level classifications:

```
sourmash tax genome
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```

Here is the `krona`-formatted output for this command:

```
fraction superkingdom phylum class order family genus species
Sb47+63 1.0 d__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Shewanellaceae g__Shewanella s__Shewanella baltica
```

Note here that specifying `--rank` forces classification by rank rather than
by the containment threshold.

To produce multiple output types from the same command, add the types into the
`--output-format` argument, e.g. `--output-format csv_summary krona`.
**Note that specifying the classification rank with `--rank`,
(e.g. `--rank species`), as needed for `krona` output, forces classification
by `rank` rather than by containment threshold.** If the query
classification at this rank does not meet the containment threshold
(default=0.1), the `status` column will contain `below_threshold`.


### `sourmash tax annotate` - annotates gather output with taxonomy

`sourmash tax annotate` adds a column with taxonomic lineage information
for each database match to gather output. Do not summarize or classify.
Note that this is not required for either `summarize` or `classify`.

By default, `annotate` uses the name of each input gather csv to write an updated
version with lineages information. For example, annotating `sample1.gather.csv`
would produce `sample1.gather.with-lineages.csv`

```
sourmash tax annotate
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
```
> This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`


## `sourmash lca` subcommands for in-memory taxonomy integration

These commands use LCA databases (created with `lca index`, below, or
prepared databases such as
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ def search_sbt_index(*args, **kwargs):

from .sbtmh import create_sbt_index
from . import lca
from . import tax
from . import sbt
from . import sbtmh
from . import sbt_storage
Expand Down
2 changes: 2 additions & 0 deletions src/sourmash/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
from . import sig as signature
from . import sketch
from . import storage
from . import tax


class SourmashParser(ArgumentParser):
Expand Down Expand Up @@ -92,6 +93,7 @@ def parse_args(self, args=None, namespace=None):

def get_parser():
module_descs = {
'tax': 'Integrate taxonomy information based on "gather" results',
'lca': 'Taxonomic operations',
'sketch': 'Create signatures',
'sig': 'Manipulate signature files',
Expand Down
10 changes: 9 additions & 1 deletion src/sourmash/cli/lca/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,11 @@ def subparser(subparsers):
)
subparser.add_argument(
'--split-identifiers', action='store_true',
help='split names in signatures on whitspace and period'
help='split names in signatures on whitespace'
)
subparser.add_argument(
'--keep-identifier-versions', action='store_true',
help='do not remove accession versions'
)
subparser.add_argument('-f', '--force', action='store_true')
subparser.add_argument(
Expand All @@ -51,6 +55,10 @@ def subparser(subparsers):
'--require-taxonomy', action='store_true',
help='ignore signatures with no taxonomy entry'
)
subparser.add_argument(
'--fail-on-missing-taxonomy', action='store_true',
help='fail quickly if taxonomy is not available for an identifier',
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
Expand Down
32 changes: 32 additions & 0 deletions src/sourmash/cli/tax/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""Define the command line interface for sourmash tax

The top level CLI is defined in ../__init__.py. This module defines the CLI for
`sourmash tax` operations.
"""

from . import metagenome
from . import genome
from . import annotate
from ..utils import command_list
from argparse import SUPPRESS, RawDescriptionHelpFormatter
import os
import sys


def subparser(subparsers):
subparser = subparsers.add_parser('tax', formatter_class=RawDescriptionHelpFormatter, usage=SUPPRESS, aliases=['taxonomy'])
desc = 'Operations\n'
clidir = os.path.dirname(__file__)
ops = command_list(clidir)
for subcmd in ops:
docstring = getattr(sys.modules[__name__], subcmd).__doc__
helpstring = 'sourmash tax {op:s} --help'.format(op=subcmd)
desc += ' {hs:33s} {ds:s}\n'.format(hs=helpstring, ds=docstring)
s = subparser.add_subparsers(
title="Integrate taxonomy information based on 'gather' results", dest='subcmd', metavar='subcmd', help=SUPPRESS,
description=desc
)
for subcmd in ops:
getattr(sys.modules[__name__], subcmd).subparser(s)
subparser._action_groups.reverse()
subparser._optionals.title = 'Options'
Loading