-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add max_containment to MinHash
class.
#1346
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1346 +/- ##
==========================================
+ Coverage 88.88% 89.15% +0.27%
==========================================
Files 123 123
Lines 18321 18593 +272
Branches 1410 1432 +22
==========================================
+ Hits 16284 16577 +293
+ Misses 1800 1780 -20
+ Partials 237 236 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
src/sourmash/index.py
Outdated
ignore_abundance = kwargs.get('ignore_abundance', False) | ||
|
||
# configure search - containment? ignore abundance? | ||
if do_containment: | ||
query_match = lambda x: query.contained_by(x, downsample=True) | ||
elif do_max_containment: | ||
query_match = lambda x: query.max_containmenty(x, downsample=True) |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
witness!search
search --containment
search --max-containment
|
@bluegenes this might be ready to try out. I wouldn't trust the SBT code just yet, but everything else should work I think. |
Thanks @ctb!! Will take it for a spin |
I somewhat forgot that the some mini results:
** note, none of these show up with the default threshold (
jaccard:
|
would it be straightforward to add |
Totes. Great idea. |
Yeah,
Specifically, imagine that you do a search with Of course, this is a problem for the current output, too, in that the actual meaning of the 🤔 Hmm, I wonder if we could include an extra column that is something like "search key" to indicate what the search was? It'll be redundant (since it would have to be in every row for this columnar data output) but at least it'd be there. Might also be a good opportunity to support JSON or YAML output from search, gather, and prefetch - then those files could contain far more information, including full command-line parameters, threshold, etc. etc. See #448. We could also turn off CSV output from search entirely, and tell people to use prefetch for programmatic foo...
Curious what you mean by "percentage threshold"? It just takes a floating point number that is the lowest similarity etc to report. OH! You mean the thresholding is actually just wrong for |
MinHash
class.MinHash
class.
hi @bluegenes this PR is ready for review (and merge?)! Other than code review (sorry for all the mess...) the two remaining things are --
|
I added the query information into the search CSV output in cbd2503. I couldn't add all of the info that @bluegenes might want in, however, because Perhaps another reason to consider getting rid of "regular" MinHash per #1354 |
(I suppose I could put zeros in for all those numbers when running on regular minhashes...) |
NA's? |
On Thu, Mar 11, 2021 at 05:01:40PM -0800, Tessa Pierce wrote:
>(I suppose I could put zeros in for all those numbers when running on regular minhashes...)
NA's?
<sigh>
|
fieldnames = ['similarity', 'name', 'filename', 'md5', | ||
'query_filename', 'query_name', 'query_md5'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to modify similarity
to containment
/ max_containment
for csv output. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
semantic versioning prevents us from removing the similarity
header before 5.0. we could add new columns, I 'spose. I don't like the idea that column headers change depending on command line arguments, though. Not sure how to think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(suggest we punt this to a new issue and discuss it there.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
punted to #1390
Co-authored-by: Tessa Pierce <[email protected]>
Co-authored-by: Tessa Pierce <[email protected]>
query=query, | ||
query_filename=query.filename, | ||
query_name=query.name, | ||
query_md5=query.md5sum()[:8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to truncate one md5 and not the other?
...and now that I think about it, it's for semantic versioning, eh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err, nope, didn't notice I wasn't truncating the md5sum for the match. Hrm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
punted to #1390
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rest lgtm!
OK, I think I addressed everything. Will wait for tests to pass before merging. |
Add
max_containment()
and--max-containment
per #1343.Fixes #1247
Fixes #1343
Main features:
MinHash.max_containment(other, downsample=False)
SourmashSignature.max_containment(other, downsample=False)
--max-containment
flag tosourmash search
do_max_containment
flag to search forIndex
classes (LinearIndex
,SBT
,LCA_Database
)In addition, this PR indulges in misc cleanup:
Index
code to use named arguments, now that legacy support for Python 2.7 was removed.result
caching insbtmh.py
.test_signature.py
- there were twotest_str
functions.test__minhash.py
-- there were twotest_mh_len
functions.SBT.search(...)
properly fails whenthreshold=None
(this code did not have test coverage)test_compare_containment_abund_flatten
test function intest_sourmash.py
.TODO:
compare --max-containment
along with testsmax_containment
does not allow non-scaled signaturesif do_containment and do_max_containment
conditionsif threshold is None:
in sbt.pyChecklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?