Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use new load_file_as_signatures function more broadly #1059

Merged
merged 57 commits into from
Jul 2, 2020
Merged
Show file tree
Hide file tree
Changes from 55 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
7757953
deploy new load_signatures API a bit more broadly
ctb Jun 28, 2020
633964a
pay attention to do_raise properly in load_signatures
ctb Jun 28, 2020
7e53388
explicitly test new sig rename behavior of failing on non-existent files
ctb Jun 28, 2020
bdf7663
refactor sourmash_args.load_dbs_and_sigs
ctb Jun 28, 2020
ff700f8
refactor traverse into load_database code
ctb Jun 28, 2020
3cfe79d
cleanup & simplify
ctb Jun 28, 2020
019713f
comment strange code
ctb Jun 28, 2020
9828650
replace load_signatures with new load fn in load_query_signatures
ctb Jun 28, 2020
4d27af7
add query signature loading from databases for search, gather
ctb Jun 28, 2020
a00d6f0
add to compare
ctb Jun 28, 2020
7ab25c1
change sourmash index to use new loading function
ctb Jun 28, 2020
93f8b4e
change sourmash lca index to use new loading function
ctb Jun 28, 2020
0c416ab
special case stdin sig loading
ctb Jun 28, 2020
d4ae150
amend lca classify and summarize to use new load_file_as_signatures
ctb Jun 28, 2020
22504e0
add --from-file to sourmash index
ctb Jun 29, 2020
e27ce1a
adjust cli for index to fix tests
ctb Jun 29, 2020
af6f135
add --query-from-file to lca classify
ctb Jun 29, 2020
6af833f
add --query-from-file to lca summarize
ctb Jun 29, 2020
9a0345f
add --from-file to lca index
ctb Jun 29, 2020
bfa137f
upgrade multigather with --query-from-file
ctb Jun 29, 2020
24dff3f
add comments to categorize
ctb Jun 29, 2020
3d7ce9b
add top-level fn load_file_as_index
ctb Jun 29, 2020
2b656f5
more properly test Index.select method on SBTs and LCAs
ctb Jun 29, 2020
d167a38
Merge branch 'master' of github.com:dib-lab/sourmash into update_load…
ctb Jun 29, 2020
4fad937
make _load_database an internal function, remove error/sys.exit output
ctb Jun 29, 2020
ed4c039
update sig export to allow --md5
ctb Jun 29, 2020
88af7b6
rename some things to make activity of traverse_find_sigs clearer
ctb Jun 29, 2020
e30e785
fix typo in sourmash_args; add test for index --from-file
ctb Jun 29, 2020
d2ea905
add tests for lca summarize --query-from-file
ctb Jun 30, 2020
04d8795
add tests for lca classify --query-from-file
ctb Jun 30, 2020
5be0b3b
rename test functions to remove traverse
ctb Jun 30, 2020
7f1fe50
trap certain errors
ctb Jun 30, 2020
5c089f0
reset traverse_yield_all to false by default
ctb Jun 30, 2020
57cc248
add tests for --from-file and --force to lca index
ctb Jun 30, 2020
2a3ceb4
test & fix --traverse-directories for lca summarize
ctb Jun 30, 2020
7910269
fix index --traverse-dir -f
ctb Jun 30, 2020
fc5ca79
add multigather tests for --query-from-file
ctb Jun 30, 2020
a883f31
add --from-file to compare + tests
ctb Jun 30, 2020
1a4e69c
add test for compare --traverse-dir -f
ctb Jun 30, 2020
520db89
fix and test bad ksize for lca db load
ctb Jun 30, 2020
16fe798
check failed gather load
ctb Jun 30, 2020
5ef3e86
add explicit test for no-such-file in load_file_as_index
ctb Jun 30, 2020
4fcdc32
add tests for sourmash sig describe on SBT, LCA, and dir
ctb Jun 30, 2020
052b4f1
add a test for sig describe on stdin
ctb Jun 30, 2020
670f68f
remove errant comment
ctb Jun 30, 2020
8b0c39d
add load_file_as_signatures at top-level sourmash
ctb Jun 30, 2020
9e24c6a
whups, remove leftover test setup stuff
ctb Jun 30, 2020
74128e6
fix IOError vs OSError by choosing OSError
ctb Jun 30, 2020
c200c63
fix quotes expected in error message
ctb Jun 30, 2020
659727a
detect file-like fp with 'read' as well
ctb Jun 30, 2020
3b8ef34
update command line docs
ctb Jun 30, 2020
f0e8d76
update documentation a bit
ctb Jul 1, 2020
56ed85f
Merge branch 'master' into update_load_sigs
ctb Jul 1, 2020
49fbe4c
improve docstrings, defaults
ctb Jul 1, 2020
a724430
Merge branch 'update_load_sigs' of github.com:dib-lab/sourmash into u…
ctb Jul 1, 2020
11e519a
make sure md5 selector is unique in collection
ctb Jul 1, 2020
8e565ba
Merge branch 'master' into update_load_sigs
luizirber Jul 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 6 additions & 8 deletions doc/api-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -422,24 +422,24 @@ checks.)
Now, save the tree:

```
>>> filename = tree.save(tempdir + '/test.sbt.json')
>>> filename = tree.save(tempdir + '/test.sbt.zip')

```

### Loading and search SBTs
### Loading and searching SBTs

How do we load the SBT and search it with a DNA sequence,
from within Python?

The SBT filename is `test.sbt.json`, as above:
The SBT filename is `test.sbt.zip`, as above:
```
>>> SBT_filename = tempdir + '/test.sbt.json'
>>> SBT_filename = tempdir + '/test.sbt.zip'

```

and with it we can load the SBT:
```
>>> tree = sourmash.load_sbt_index(SBT_filename)
>>> tree = sourmash.load_file_as_index(SBT_filename)

```

Expand All @@ -465,9 +465,7 @@ and create a scaled signature:
Now do a search --

```
>>> threshold = 0.1

>>> for found_sig, similarity in sourmash.search_sbt_index(tree, query_sig, threshold):
>>> for similarity, found_sig, filename in tree.search(query_sig, threshold=0.1):
... print(query_sig.name())
... print(found_sig.name())
... print(similarity)
Expand Down
27 changes: 26 additions & 1 deletion doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,12 @@ Finally, there are a number of utility and information commands:
* `categorize` is an experimental command to categorize many signatures.
* `watch` is an experimental command to classify a stream of sequencing data.

Please use the command line option `--help` to get more detailed usage
information for each command.

Note that as of sourmash v3.4, most commands will load signatures from
indexed databases (the SBT and LCA formats) as well as from signature files.

### `sourmash compute`

The `compute` subcommand computes and saves signatures for
Expand Down Expand Up @@ -118,7 +124,7 @@ Optional arguments:
### `sourmash compare`


The `compare` subcommand compares one or more signature files
The `compare` subcommand compares one or more signatures
(created with `compute`) using estimated [Jaccard index][3] or
(if signatures are computed with `--track-abundance`) the [angular
similarity](https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity).
Expand All @@ -142,6 +148,8 @@ Options:
--ksize -- do the comparisons at this k-mer size.
--containment -- compute containment instead of similarity.
C(i, j) = size(i intersection j) / size(i).
--from-file -- append the list of files in this text file to the input
signatures
```

**Note:** compare by default produces a symmetric similarity matrix that can be used as an input to clustering. With `--containment`, however, this matrix is no longer symmetric and cannot formally be used for clustering.
Expand Down Expand Up @@ -316,6 +324,11 @@ species level assignments would not be reported.
(This is the approach that Kraken and other lowest common ancestor
implementations use, we believe.)

Note: you can specify a list of files to load signatures from in a
text file passed to `sourmash lca classify` with the
`--query-from-file` flag; these files will be appended to the `--query`
input.

### `sourmash lca summarize`

`sourmash lca summarize` produces a Kraken-style summary of the
Expand Down Expand Up @@ -416,6 +429,11 @@ genome is present only once; when weighted by abundance, the Bacterial genome
is only 41.8% of the metagenome content, while the Archaeal genome is
58.1% of the metagenome content.

Note: you can specify a list of files to load signatures from in a
text file passed to `sourmash lca summarize` with the
`--query-from-file` flag; these files will be appended to the `--query`
input.

### `sourmash lca gather`

The `sourmash lca gather` command finds all non-overlapping
Expand Down Expand Up @@ -466,6 +484,9 @@ genomes (or building off of NCBI taxonomies more generally), please
see
[the NCBI lineage repository](https://github.com/dib-lab/2018-ncbi-lineages).

You can use `--from-file` to pass `lca index` a text file containing a
list of files to index.

### `sourmash lca rankinfo`

The `sourmash lca rankinfo` command displays k-mer specificity
Expand Down Expand Up @@ -508,6 +529,10 @@ such as `search`, `gather`, and `compare`.

Note, you can use `sourmash sig` as shorthand for all of these commands.

Most commands will load signatures automatically from indexed databases
(SBT and LCA formats) as well as from signature files, and you can load
signatures from stdin using `-` on the command line.

### `sourmash signature cat`

Concatenate signature files.
Expand Down
2 changes: 2 additions & 0 deletions sourmash/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,5 @@
from . import sig
from . import cli
from . import commands
from .sourmash_args import load_file_as_index
from .sourmash_args import load_file_as_signatures
7 changes: 6 additions & 1 deletion sourmash/cli/compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
def subparser(subparsers):
subparser = subparsers.add_parser('compare')
subparser.add_argument(
'signatures', nargs='+', help='list of signatures to compare'
'signatures', nargs='*', help='list of signatures to compare',
default=[]
)
subparser.add_argument(
'-q', '--quiet', action='store_true', help='suppress non-error output'
Expand All @@ -30,6 +31,10 @@ def subparser(subparsers):
'--traverse-directory', action='store_true',
help='compare all signatures underneath directories'
)
subparser.add_argument(
'--from-file',
help='a file containing a list of signatures file to compare'
)
subparser.add_argument(
'-f', '--force', action='store_true',
help='continue past errors in file loading'
Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ def subparser(subparsers):
'--ignore-abundance', action='store_true',
help='do NOT use k-mer abundances if present'
)
subparser.add_argument(
'--md5', default=None,
help='select the signature with this md5 as query'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@ def subparser(subparsers):
'signatures', nargs='+',
help='signatures to load into SBT'
)
subparser.add_argument(
'--from-file',
help='a file containing a list of signatures file to load'
)
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
Expand Down
8 changes: 6 additions & 2 deletions sourmash/cli/lca/classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,12 @@

def subparser(subparsers):
subparser = subparsers.add_parser('classify')
subparser.add_argument('--db', nargs='+', action='append')
subparser.add_argument('--query', nargs='+', action='append')
subparser.add_argument('--db', nargs='+', action='append',
help='databases to use to classify')
subparser.add_argument('--query', nargs='*', default=[], action='append',
help='query signatures to classify')
subparser.add_argument('--query-from-file',
help='file containing list of signature files to query')
subparser.add_argument('--threshold', metavar='T', type=int, default=5)
subparser.add_argument(
'-q', '--quiet', action='store_true',
Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/lca/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ def subparser(subparsers):
'signatures', nargs='+',
help='one or more sourmash signatures'
)
subparser.add_argument(
'--from-file',
help='a file containing a list of signatures file to load'
)
subparser.add_argument(
'--scaled', metavar='S', default=10000, type=float
)
Expand Down
4 changes: 3 additions & 1 deletion sourmash/cli/lca/summarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,10 @@ def subparser(subparsers):
subparser = subparsers.add_parser('summarize')
subparser.add_argument('--db', nargs='+', action='append',
help='one or more LCA databases to use')
subparser.add_argument('--query', nargs='+', action='append',
subparser.add_argument('--query', nargs='*', default=[], action='append',
help='one or more signature files to use as queries')
subparser.add_argument('--query-from-file',
help='file containing list of signature files to query')
subparser.add_argument('--threshold', metavar='T', type=int, default=5,
help='minimum number of hashes to require for a match')
subparser.add_argument(
Expand Down
6 changes: 5 additions & 1 deletion sourmash/cli/multigather.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,13 @@
def subparser(subparsers):
subparser = subparsers.add_parser('multigather')
subparser.add_argument(
'--query', nargs='+', action='append',
'--query', nargs='*', default=[], action='append',
help='query signature'
)
subparser.add_argument(
'--query-from-file',
help='file containing list of signature files to query'
)
subparser.add_argument(
'--db', nargs='+', action='append',
help='signatures/SBTs to search',
Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ def subparser(subparsers):
'-o', '--output', metavar='FILE',
help='output CSV containing matches to this file'
)
subparser.add_argument(
'--md5', default=None,
help='select the signature with this md5 as query'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

Expand Down
4 changes: 4 additions & 0 deletions sourmash/cli/sig/export.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ def subparser(subparsers):
'-o', '--output', metavar='FILE',
help='output signature to this file (default stdout)'
)
subparser.add_argument(
'--md5', default=None,
help='select the signature with this md5 as query'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

Expand Down
Loading