Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adjust Index.find search protocol to support selective collection of matches #1477

Merged
merged 121 commits into from
Apr 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
4c09f5b
begin refactoring 'categorize'
ctb Mar 12, 2021
af6fd84
have the 'find' function for SBTs return signatures
ctb Mar 12, 2021
8a92936
fix majority of tests
ctb Mar 12, 2021
c4adabf
Merge branch 'latest' of github.com:dib-lab/sourmash into fix/sbt_find
ctb Mar 12, 2021
cdb4159
comment & then fix test
ctb Mar 12, 2021
a414624
torture the tests into working
ctb Mar 12, 2021
6f7d368
split find and _find_nodes to take different kinds of functions
ctb Mar 13, 2021
7b2f624
Merge branch 'fix/sbt_find' into refactor/categorize
ctb Mar 13, 2021
b5ab6d7
redo 'find' on index
ctb Mar 13, 2021
ed7d52b
refactor lca_db to use new find
ctb Mar 13, 2021
aec730e
refactor SBT to use new find
ctb Mar 13, 2021
590b3d6
comment/cleanup
ctb Mar 13, 2021
eb7d661
refactor out common code
ctb Mar 13, 2021
0639c3e
fix up gather
ctb Mar 13, 2021
a65c79b
use 'passes' properly
ctb Mar 13, 2021
02794ee
attempted cleanup
ctb Mar 13, 2021
f94e909
minor fixes
ctb Mar 13, 2021
c3a65ac
get a start on correct downsampling
ctb Mar 13, 2021
9054cb8
adjust tree downsampling for regular minhashes, too
ctb Mar 13, 2021
db740ec
remove now-unused search functions in sbtmh
ctb Mar 13, 2021
03a5e60
refactor categorize to use new find
ctb Mar 13, 2021
b3718dd
cleanup and removal
ctb Mar 13, 2021
e8e4702
remove redundant code in lca_db
ctb Mar 13, 2021
b40963c
remove redundant code in SBT
ctb Mar 13, 2021
055bd60
add notes
ctb Mar 13, 2021
2329009
remove more unused code
ctb Mar 13, 2021
e6d90f6
refactor most of the test_sbt tests
ctb Mar 13, 2021
2baa8c3
fix one minor issue
ctb Mar 13, 2021
0ec99ea
fix jaccard calculation in sbt
ctb Mar 13, 2021
c583a37
check for compatibility of search fn and query signature
ctb Mar 13, 2021
d565e67
switch tests over to jaccard similarity, not containment
ctb Mar 13, 2021
8eb43f7
fix test
ctb Mar 13, 2021
5c75e39
remove test for unimplemented LCA_Database.find method
ctb Mar 13, 2021
83ee16b
document threshold change; update test
ctb Mar 14, 2021
7bfa0e1
refuse to run abund signatures
ctb Mar 14, 2021
2c28568
flatten sigs internally for gather
ctb Mar 14, 2021
9adae36
reinflate abundances for saving
ctb Mar 14, 2021
c979b17
fix problem where sbt indices coudl be created with abund signatures
ctb Mar 14, 2021
0bf34cd
more
ctb Mar 15, 2021
3844b02
split flat and abund search
ctb Mar 16, 2021
f6fe0de
make ignore_abundance work again for categorize
ctb Mar 16, 2021
863e4de
turn off best-only, since it triggers on self-hits.
ctb Mar 16, 2021
731df73
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Mar 16, 2021
21e8867
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Mar 20, 2021
80c14c2
add test: 'sourmash index' flattens sigs
ctb Mar 20, 2021
138bd16
add note about something to test
ctb Mar 20, 2021
d438f9c
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 3, 2021
e406a99
fix typo; still broken tho
ctb Apr 3, 2021
182ad62
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 4, 2021
74c925d
location is now a property
ctb Apr 4, 2021
87811a4
move search code into search.py
ctb Apr 4, 2021
45b1f5e
remove redundant scaled checking code
ctb Apr 4, 2021
7b76751
best-only now works properly for two tests
ctb Apr 4, 2021
2248b06
'fix' tests by removing v1 and v2 SBT compatibility
ctb Apr 4, 2021
0aa4bd2
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 9, 2021
66dc4a7
simplify (?) downsampling code
ctb Apr 9, 2021
b7a3ba2
require keyword args in MinHash.downsample(...)
ctb Apr 9, 2021
7d3885e
fix bug with downsample
ctb Apr 9, 2021
c686662
require keyword args in MinHash.downsample(...)
ctb Apr 9, 2021
39d13cc
fix test to use proper downsampling, reverse order to match scaled
ctb Apr 9, 2021
86e1f41
add test for revealed bug
ctb Apr 9, 2021
78aa70c
remove unnecessary comment
ctb Apr 9, 2021
d4b291a
Merge branch 'fix/downsample_kwargs' into refactor/index_find
ctb Apr 9, 2021
cb712c0
flatten subject MinHash, too
ctb Apr 9, 2021
ba7352e
add testme comment
ctb Apr 9, 2021
31d08e0
clean up sbt find
ctb Apr 9, 2021
9feda90
clean up lca find
ctb Apr 9, 2021
9b9d518
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 10, 2021
36cc35e
add IndexSearchResult namedtuple for search and gather results
ctb Apr 10, 2021
a6cd259
add more tests for Index classes
ctb Apr 10, 2021
54126ae
add tests for subj & query num downsampling
ctb Apr 10, 2021
16c464e
tests for Index.search_abund
ctb Apr 10, 2021
2e0bc9d
refactor a bit
ctb Apr 10, 2021
87ffe00
refactor make_jaccard_search_query; start tests
ctb Apr 10, 2021
1a4cfd4
even more tests
ctb Apr 10, 2021
184e541
test collect, best_only
ctb Apr 10, 2021
ebd5aac
more search tests
ctb Apr 10, 2021
430cb2e
remove unnec space
ctb Apr 10, 2021
b218540
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 11, 2021
cc2ec29
add minor comment
ctb Apr 11, 2021
c2b4eda
deal with status == None on SystemExit
ctb Apr 11, 2021
1bda989
upgrade and simplify categorize
ctb Apr 11, 2021
a7f5306
restore test
ctb Apr 11, 2021
2db2586
merge
ctb Apr 11, 2021
8c84397
fix abundance search in SBT for categorize
ctb Apr 13, 2021
1c6a539
code cleanup and refactoring; check for proper error messages
ctb Apr 13, 2021
8af9187
add explicit test for incompatible num
ctb Apr 14, 2021
379743d
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 14, 2021
5b4b5ed
refactor MinHash.downsample
ctb Apr 14, 2021
1e70d07
deal with status == None on SystemExit
ctb Apr 11, 2021
495f0bf
fix test
ctb Apr 14, 2021
1660df5
fix comment mispelling
ctb Apr 14, 2021
77f6e0a
properly pass kwargs; fix search_sbt_index
ctb Apr 14, 2021
72639bd
add simple tests for SBT load and search API
ctb Apr 14, 2021
e916214
Merge branch 'refactor/minhash_downsample' into refactor/index_find
ctb Apr 14, 2021
a735445
Merge branch 'fix/sys_exit_none' into refactor/index_find
ctb Apr 14, 2021
922db44
Merge branch 'fix/search_sbt_index' into refactor/index_find
ctb Apr 14, 2021
5b8d83c
allow arbitrary kwargs for LCA_DAtabase.find
ctb Apr 14, 2021
8adc01c
add testing of passthru-kwargs
ctb Apr 15, 2021
f70af9c
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 15, 2021
b07c61d
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 15, 2021
d9c07ce
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 16, 2021
5b308bc
re-enable test
ctb Apr 16, 2021
02c04d6
add notes to update docstrings
ctb Apr 16, 2021
db52ee7
docstring updates
ctb Apr 16, 2021
c50dcdb
fix test
ctb Apr 16, 2021
e4cfe97
Merge branch 'latest' into refactor/index_find
luizirber Apr 16, 2021
11b7486
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 16, 2021
637723b
Merge branch 'latest' into refactor/index_find
ctb Apr 17, 2021
7759314
better tests for gather --save-unassigned
ctb Apr 18, 2021
8376ce5
Merge branch 'refactor/index_find' of github.com:dib-lab/sourmash int…
ctb Apr 18, 2021
593a907
remove unnecessary check-me comment
ctb Apr 19, 2021
4132162
clear out docstring
ctb Apr 19, 2021
23166df
SBT search doesn't work on v1 and v2 SBTs b/c no min_n_below
ctb Apr 19, 2021
3a5901e
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 20, 2021
57467cd
fix my dumb mistake with gather
ctb Apr 21, 2021
35abcf5
have the JaccardSearch.collect function take the matching signature
ctb Apr 21, 2021
d71ef15
adjust search protocol to permit ignoring after finding
ctb Apr 21, 2021
cd04714
Merge branch 'latest' of github.com:dib-lab/sourmash into add/collect…
ctb Apr 22, 2021
ceda626
comment me
ctb Apr 22, 2021
5309bcb
update comments, threshold setting
ctb Apr 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/sourmash/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,8 +125,8 @@ def prepare_query(query_mh, subj_mh):
if search_fn.passes(score):
# note: here we yield the original signature, not the
# downsampled minhash.
search_fn.collect(score)
yield subj, score
if search_fn.collect(score, subj):
yield subj, score

def search_abund(self, query, *, threshold=None, **kwargs):
"""Return set of matches with angular similarity above 'threshold'.
Expand Down
9 changes: 7 additions & 2 deletions src/sourmash/lca/lca_db.py
Original file line number Diff line number Diff line change
Expand Up @@ -462,9 +462,14 @@ def find(self, search_fn, query, **kwargs):

score = search_fn.score_fn(query_size, shared_size, subj_size,
total_size)

# note to self: even with JaccardSearchBestOnly, this will
# still iterate over & score all signatures. We should come
# up with a protocol by which the JaccardSearch object can
# signal that it is done, or something.
if search_fn.passes(score):
search_fn.collect(score)
yield subj, score
if search_fn.collect(score, subj):
yield subj, score

@cached_property
def lid_to_idx(self):
Expand Down
9 changes: 6 additions & 3 deletions src/sourmash/sbt.py
Original file line number Diff line number Diff line change
Expand Up @@ -436,9 +436,12 @@ def node_search(node, *args, **kwargs):

if search_fn.passes(score):
if is_leaf: # terminal node? keep.
results[node.data] = score
search_fn.collect(score)
return True
if search_fn.collect(score, node.data):
results[node.data] = score
return True
else: # it's a good internal node, keep.
return True

return False

# & execute!
Expand Down
20 changes: 14 additions & 6 deletions src/sourmash/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,8 @@ def make_gather_query(query_mh, threshold_bp):
if threshold > 1.0:
return None

search_obj = JaccardSearchBestOnly(SearchType.CONTAINMENT, threshold=threshold)
search_obj = JaccardSearchBestOnly(SearchType.CONTAINMENT,
threshold=threshold)

return search_obj

Expand Down Expand Up @@ -111,14 +112,20 @@ def check_is_compatible(self, sig):
raise TypeError("this search cannot be done with an abund signature")

def passes(self, score):
"Return True if this score meets or exceeds the threshold."
"""Return True if this score meets or exceeds the threshold.

Note: this can be used whenever a score or estimate is available
(e.g. internal nodes on an SBT). `collect(...)`, below, decides
whether a particular signature should be collected, and/or can
update the threshold (used for BestOnly behavior).
"""
if score and score >= self.threshold:
return True
return False

def collect(self, score):
"Is this a potential match?"
pass
def collect(self, score, match_sig):
"Return True if this match should be collected."
return True

def score_jaccard(self, query_size, shared_size, subject_size, total_size):
"Calculate Jaccard similarity."
Expand All @@ -142,9 +149,10 @@ def score_max_containment(self, query_size, shared_size, subject_size,

class JaccardSearchBestOnly(JaccardSearch):
"A subclass of JaccardSearch that implements best-only."
def collect(self, score):
def collect(self, score, match):
"Raise the threshold to the best match found so far."
self.threshold = max(self.threshold, score)
return True


# generic SearchResult tuple.
Expand Down
126 changes: 126 additions & 0 deletions tests/test_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from sourmash.sbt import SBT, GraphFactory, Leaf
from sourmash.sbtmh import SigLeaf
from sourmash import sourmash_args
from sourmash.search import JaccardSearch, SearchType

import sourmash_tst_utils as utils

Expand Down Expand Up @@ -1081,3 +1082,128 @@ def test_multi_index_load_from_pathlist_3_zipfile(c):

mi = MultiIndex.load_from_pathlist(file_list)
assert len(mi) == 7

##
## test a slightly outre version of JaccardSearch - this is a test of the
## JaccardSearch 'collect' protocol, in particular...
##

class JaccardSearchBestOnly_ButIgnore(JaccardSearch):
"A class that ignores certain results, but still does all the pruning."
def __init__(self, ignore_list):
super().__init__(SearchType.JACCARD, threshold=0.1)
self.ignore_list = ignore_list

# a collect function that _ignores_ things in the ignore_list
def collect(self, score, match):
print('in collect; current threshold:', self.threshold)
for q in self.ignore_list:
print('ZZZ', match, match.similarity(q))
if match.similarity(q) == 1.0:
print('yes, found.')
return False

# update threshold if not perfect match, which could help prune.
self.threshold = score
return True


def test_linear_index_gather_ignore():
sig2 = utils.get_test_data('2.fa.sig')
sig47 = utils.get_test_data('47.fa.sig')
sig63 = utils.get_test_data('63.fa.sig')

ss2 = sourmash.load_one_signature(sig2, ksize=31)
ss47 = sourmash.load_one_signature(sig47, ksize=31)
ss63 = sourmash.load_one_signature(sig63, ksize=31)

# construct an index...
lidx = LinearIndex([ss2, ss47, ss63])

# ...now search with something that should ignore sig47, the exact match.
search_fn = JaccardSearchBestOnly_ButIgnore([ss47])

results = list(lidx.find(search_fn, ss47))
results = [ ss for (ss, score) in results ]

def is_found(ss, xx):
for q in xx:
print(ss, ss.similarity(q))
if ss.similarity(q) == 1.0:
return True
return False

assert not is_found(ss47, results)
assert not is_found(ss2, results)
assert is_found(ss63, results)


def test_lca_index_gather_ignore():
from sourmash.lca import LCA_Database

sig2 = utils.get_test_data('2.fa.sig')
sig47 = utils.get_test_data('47.fa.sig')
sig63 = utils.get_test_data('63.fa.sig')

ss2 = sourmash.load_one_signature(sig2, ksize=31)
ss47 = sourmash.load_one_signature(sig47, ksize=31)
ss63 = sourmash.load_one_signature(sig63, ksize=31)

# construct an index...
db = LCA_Database(ksize=31, scaled=1000)
db.insert(ss2)
db.insert(ss47)
db.insert(ss63)

# ...now search with something that should ignore sig47, the exact match.
search_fn = JaccardSearchBestOnly_ButIgnore([ss47])

results = list(db.find(search_fn, ss47))
results = [ ss for (ss, score) in results ]

def is_found(ss, xx):
for q in xx:
print(ss, ss.similarity(q))
if ss.similarity(q) == 1.0:
return True
return False

assert not is_found(ss47, results)
assert not is_found(ss2, results)
assert is_found(ss63, results)


def test_sbt_index_gather_ignore():
sig2 = utils.get_test_data('2.fa.sig')
sig47 = utils.get_test_data('47.fa.sig')
sig63 = utils.get_test_data('63.fa.sig')

ss2 = sourmash.load_one_signature(sig2, ksize=31)
ss47 = sourmash.load_one_signature(sig47, ksize=31)
ss63 = sourmash.load_one_signature(sig63, ksize=31)

# construct an index...
factory = GraphFactory(5, 100, 3)
db = SBT(factory, d=2)

db.insert(ss2)
db.insert(ss47)
db.insert(ss63)

# ...now search with something that should ignore sig47, the exact match.
print(f'\n** trying to ignore {ss47}')
search_fn = JaccardSearchBestOnly_ButIgnore([ss47])

results = list(db.find(search_fn, ss47))
results = [ ss for (ss, score) in results ]

def is_found(ss, xx):
for q in xx:
print('is found?', ss, ss.similarity(q))
if ss.similarity(q) == 1.0:
return True
return False

assert not is_found(ss47, results)
assert not is_found(ss2, results)
assert is_found(ss63, results)
4 changes: 2 additions & 2 deletions tests/test_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,13 +118,13 @@ def test_score_jaccard_max_containment_zero_query_size():

def test_collect():
search_obj = make_jaccard_search_query(threshold=0)
search_obj.collect(1.0)
search_obj.collect(1.0, None)
assert search_obj.threshold == 0


def test_collect_best_only():
search_obj = make_jaccard_search_query(threshold=0, best_only=True)
search_obj.collect(1.0)
search_obj.collect(1.0, None)
assert search_obj.threshold == 1.0


Expand Down