Skip to content

Commit

Permalink
use new load_file_as_signatures function more broadly (#1059)
Browse files Browse the repository at this point in the history
* deploy new load_signatures API a bit more broadly
* pay attention to do_raise properly in load_signatures
* explicitly test new sig rename behavior of failing on non-existent files
* refactor sourmash_args.load_dbs_and_sigs
* refactor traverse into load_database code
* cleanup & simplify
* comment strange code
* replace load_signatures with new load fn in load_query_signatures
* add query signature loading from databases for search, gather
* add to compare
* change sourmash index to use new loading function
* change sourmash lca index to use new loading function
* special case stdin sig loading
* amend lca classify and summarize to use new load_file_as_signatures
* add --from-file to sourmash index
* adjust cli for index to fix tests
* add --query-from-file to lca classify
* add --query-from-file to lca summarize
* add --from-file to lca index
* upgrade multigather with --query-from-file
* add comments to categorize
* add top-level fn load_file_as_index
* more properly test Index.select method on SBTs and LCAs
* make _load_database an internal function, remove error/sys.exit output
* update sig export to allow --md5
* rename some things to make activity of traverse_find_sigs clearer
* fix typo in sourmash_args; add test for index --from-file
* add tests for lca summarize --query-from-file
* add tests for lca classify --query-from-file
* rename test functions to remove traverse
* trap certain errors
* reset traverse_yield_all to false by default
* add tests for --from-file and --force to lca index
* test & fix --traverse-directories for lca summarize
* fix index --traverse-dir -f
* add multigather tests for --query-from-file
* add --from-file to compare + tests
* add test for compare --traverse-dir -f
* fix and test bad ksize for lca db load
* check failed gather load
* add explicit test for no-such-file in load_file_as_index
* add tests for sourmash sig describe on SBT, LCA, and dir
* add a test for sig describe on stdin
* remove errant comment
* add load_file_as_signatures at top-level sourmash
* whups, remove leftover test setup stuff
* fix IOError vs OSError by choosing OSError
* fix quotes expected in error message
* detect file-like fp with 'read' as well
* update command line docs
* update documentation a bit
* improve docstrings, defaults
* make sure md5 selector is unique in collection
  • Loading branch information
ctb authored Jul 2, 2020
1 parent 760daf6 commit 63a07db
Show file tree
Hide file tree
Showing 27 changed files with 1,114 additions and 212 deletions.
14 changes: 6 additions & 8 deletions doc/api-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -422,24 +422,24 @@ checks.)
Now, save the tree:

```
>>> filename = tree.save(tempdir + '/test.sbt.json')
>>> filename = tree.save(tempdir + '/test.sbt.zip')
```

### Loading and search SBTs
### Loading and searching SBTs

How do we load the SBT and search it with a DNA sequence,
from within Python?

The SBT filename is `test.sbt.json`, as above:
The SBT filename is `test.sbt.zip`, as above:
```
>>> SBT_filename = tempdir + '/test.sbt.json'
>>> SBT_filename = tempdir + '/test.sbt.zip'
```

and with it we can load the SBT:
```
>>> tree = sourmash.load_sbt_index(SBT_filename)
>>> tree = sourmash.load_file_as_index(SBT_filename)
```

Expand All @@ -465,9 +465,7 @@ and create a scaled signature:
Now do a search --

```
>>> threshold = 0.1
>>> for found_sig, similarity in sourmash.search_sbt_index(tree, query_sig, threshold):
>>> for similarity, found_sig, filename in tree.search(query_sig, threshold=0.1):
... print(query_sig.name())
... print(found_sig.name())
... print(similarity)
Expand Down
27 changes: 26 additions & 1 deletion doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,12 @@ Finally, there are a number of utility and information commands:
* `categorize` is an experimental command to categorize many signatures.
* `watch` is an experimental command to classify a stream of sequencing data.

Please use the command line option `--help` to get more detailed usage
information for each command.

Note that as of sourmash v3.4, most commands will load signatures from
indexed databases (the SBT and LCA formats) as well as from signature files.

### `sourmash compute`

The `compute` subcommand computes and saves signatures for
Expand Down Expand Up @@ -118,7 +124,7 @@ Optional arguments:
### `sourmash compare`


The `compare` subcommand compares one or more signature files
The `compare` subcommand compares one or more signatures
(created with `compute`) using estimated [Jaccard index][3] or
(if signatures are computed with `--track-abundance`) the [angular
similarity](https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity).
Expand All @@ -142,6 +148,8 @@ Options:
--ksize -- do the comparisons at this k-mer size.
--containment -- compute containment instead of similarity.
C(i, j) = size(i intersection j) / size(i).
--from-file -- append the list of files in this text file to the input
signatures
```

**Note:** compare by default produces a symmetric similarity matrix that can be used as an input to clustering. With `--containment`, however, this matrix is no longer symmetric and cannot formally be used for clustering.
Expand Down Expand Up @@ -316,6 +324,11 @@ species level assignments would not be reported.
(This is the approach that Kraken and other lowest common ancestor
implementations use, we believe.)

Note: you can specify a list of files to load signatures from in a
text file passed to `sourmash lca classify` with the
`--query-from-file` flag; these files will be appended to the `--query`
input.

### `sourmash lca summarize`

`sourmash lca summarize` produces a Kraken-style summary of the
Expand Down Expand Up @@ -416,6 +429,11 @@ genome is present only once; when weighted by abundance, the Bacterial genome
is only 41.8% of the metagenome content, while the Archaeal genome is
58.1% of the metagenome content.

Note: you can specify a list of files to load signatures from in a
text file passed to `sourmash lca summarize` with the
`--query-from-file` flag; these files will be appended to the `--query`
input.

### `sourmash lca gather`

The `sourmash lca gather` command finds all non-overlapping
Expand Down Expand Up @@ -466,6 +484,9 @@ genomes (or building off of NCBI taxonomies more generally), please
see
[the NCBI lineage repository](https://github.com/dib-lab/2018-ncbi-lineages).

You can use `--from-file` to pass `lca index` a text file containing a
list of files to index.

### `sourmash lca rankinfo`

The `sourmash lca rankinfo` command displays k-mer specificity
Expand Down Expand Up @@ -508,6 +529,10 @@ such as `search`, `gather`, and `compare`.

Note, you can use `sourmash sig` as shorthand for all of these commands.

Most commands will load signatures automatically from indexed databases
(SBT and LCA formats) as well as from signature files, and you can load
signatures from stdin using `-` on the command line.

### `sourmash signature cat`

Concatenate signature files.
Expand Down
2 changes: 2 additions & 0 deletions sourmash/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,5 @@
from . import sig
from . import cli
from . import commands
from .sourmash_args import load_file_as_index
from .sourmash_args import load_file_as_signatures
7 changes: 6 additions & 1 deletion sourmash/cli/compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
def subparser(subparsers):
subparser = subparsers.add_parser('compare')
subparser.add_argument(
'signatures', nargs='+', help='list of signatures to compare'
'signatures', nargs='*', help='list of signatures to compare',
default=[]
)
subparser.add_argument(
'-q', '--quiet', action='store_true', help='suppress non-error output'
Expand All @@ -30,6 +31,10 @@ def subparser(subparsers):
'--traverse-directory', action='store_true',
help='compare all signatures underneath directories'
)
subparser.add_argument(
'--from-file',
help='a file containing a list of signatures file to compare'
)
subparser.add_argument(
'-f', '--force', action='store_true',
help='continue past errors in file loading'
Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ def subparser(subparsers):
'--ignore-abundance', action='store_true',
help='do NOT use k-mer abundances if present'
)
subparser.add_argument(
'--md5', default=None,
help='select the signature with this md5 as query'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@ def subparser(subparsers):
'signatures', nargs='+',
help='signatures to load into SBT'
)
subparser.add_argument(
'--from-file',
help='a file containing a list of signatures file to load'
)
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
Expand Down
8 changes: 6 additions & 2 deletions sourmash/cli/lca/classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,12 @@

def subparser(subparsers):
subparser = subparsers.add_parser('classify')
subparser.add_argument('--db', nargs='+', action='append')
subparser.add_argument('--query', nargs='+', action='append')
subparser.add_argument('--db', nargs='+', action='append',
help='databases to use to classify')
subparser.add_argument('--query', nargs='*', default=[], action='append',
help='query signatures to classify')
subparser.add_argument('--query-from-file',
help='file containing list of signature files to query')
subparser.add_argument('--threshold', metavar='T', type=int, default=5)
subparser.add_argument(
'-q', '--quiet', action='store_true',
Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/lca/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ def subparser(subparsers):
'signatures', nargs='+',
help='one or more sourmash signatures'
)
subparser.add_argument(
'--from-file',
help='a file containing a list of signatures file to load'
)
subparser.add_argument(
'--scaled', metavar='S', default=10000, type=float
)
Expand Down
4 changes: 3 additions & 1 deletion sourmash/cli/lca/summarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,10 @@ def subparser(subparsers):
subparser = subparsers.add_parser('summarize')
subparser.add_argument('--db', nargs='+', action='append',
help='one or more LCA databases to use')
subparser.add_argument('--query', nargs='+', action='append',
subparser.add_argument('--query', nargs='*', default=[], action='append',
help='one or more signature files to use as queries')
subparser.add_argument('--query-from-file',
help='file containing list of signature files to query')
subparser.add_argument('--threshold', metavar='T', type=int, default=5,
help='minimum number of hashes to require for a match')
subparser.add_argument(
Expand Down
6 changes: 5 additions & 1 deletion sourmash/cli/multigather.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,13 @@
def subparser(subparsers):
subparser = subparsers.add_parser('multigather')
subparser.add_argument(
'--query', nargs='+', action='append',
'--query', nargs='*', default=[], action='append',
help='query signature'
)
subparser.add_argument(
'--query-from-file',
help='file containing list of signature files to query'
)
subparser.add_argument(
'--db', nargs='+', action='append',
help='signatures/SBTs to search',
Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ def subparser(subparsers):
'-o', '--output', metavar='FILE',
help='output CSV containing matches to this file'
)
subparser.add_argument(
'--md5', default=None,
help='select the signature with this md5 as query'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/sig/export.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ def subparser(subparsers):
'-o', '--output', metavar='FILE',
help='output signature to this file (default stdout)'
)
subparser.add_argument(
'--md5', default=None,
help='select the signature with this md5 as query'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

Expand Down
Loading

0 comments on commit 63a07db

Please sign in to comment.