Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature so that sourmash gather ignores perfect matches #433

Closed
ctb opened this issue Mar 9, 2018 · 7 comments
Closed

Add feature so that sourmash gather ignores perfect matches #433

ctb opened this issue Mar 9, 2018 · 7 comments

Comments

@ctb
Copy link
Contributor

ctb commented Mar 9, 2018

See #432 for rationale; basically, if you run gather on a signature that is present in a database, it will always report itself. sourmash search already ignores signatures that are identical to the query.

@ctb
Copy link
Contributor Author

ctb commented May 3, 2020

This would definitely require modifying the Index.gather interface.

Thinking about this, I'm not sure it's possible with the current SBT gather implementation without abandoning some of the optimizations. LCA gather would require modifications, too, but could be easier.

@ctb
Copy link
Contributor Author

ctb commented Jul 18, 2020

ref #849 and motivation here, dib-lab/charcoal#121

@ctb
Copy link
Contributor Author

ctb commented Apr 23, 2021

underlying support for this added in #1477.

@ctb
Copy link
Contributor Author

ctb commented May 8, 2021

this is easily supported at the underlying algorithmic level by #1370.

@ctb
Copy link
Contributor Author

ctb commented Jun 25, 2021

@bluegenes added generic functionality to support this in #1623; thinking about adding command-line options now :)

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2022

I... think --exclude-db-pattern supports this now? Yep:

% sourmash gather podar-ref/1.fa.sig podar-ref.zip --exclude CP001941.1

== This is sourmash version 4.3.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: CP001941.1 Aciduliprofundum bo... (k=31, DNA)
loaded 1 databases.

Starting prefetch sweep across databases.
Found 0 signatures via prefetch; now doing gather.
found less than 50.0 kbp in common. => exiting

found 0 matches total;
the recovered matches hit 0.0% of the query (unweighted)

🎉

@bluegenes does this meet your needs for workflows, or do you think we should put in a command-line switch '--exclude-self` that does this automagically?

I imagine the simplest way to implement --exclude-self would be to do it based on exclude pattern matching against the full md5sum. But I can see some edge cases where it would require making use of different internal mechanisms than the pattern matching: for example, if the query and database have different scaleds or track_abunds, then the md5sum of the query would be different from the md5sum of the signature from the database. A better way to do this might be in the search_fn.collect(...) method added in #1370, which could downsample the two sketches to make them comparable and then ask if they are identical with ==.

If we can't trust the identifiers (which is always a questionable proposition...) we'd need to use md5sum / sketch identity 🤔 .

On the flip flip side, this is maybe niche enough that rather than a dedicated command line option it's simpler to do whatever exclusion makes sense to a workflow based on that workflow's special needs, and make sure we provide the command line options to support that, which is now available with picklists and/or --exclude-db-pattern.

@ctb
Copy link
Contributor Author

ctb commented Aug 3, 2022

I'm going to close this now that we have picklists and --exclude. If it comes up again, we can implement whatever makes sense there!

@ctb ctb closed this as completed Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant