-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add manifests to support fast Index.select(...)
and lazy loading
#1590
Conversation
Codecov Report
@@ Coverage Diff @@
## add/picklist_selectors #1590 +/- ##
==========================================================
- Coverage 89.07% 87.53% -1.54%
==========================================================
Files 76 77 +1
Lines 6735 6933 +198
Branches 1209 1251 +42
==========================================================
+ Hits 5999 6069 +70
- Misses 519 637 +118
- Partials 217 227 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Index.select(...)
and lazy loadingIndex.select(...)
and lazy loading
Ready for review @luizirber @bluegenes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, really excited for this!
two things I'm thinking of maybe adding, either to this or a new PR -
update: added in #1633 |
am I right in assuming that with manifests, we can start doing terrible things like running a gather against all available databases (rather than only using databases with the right ksize/scaled) without measurably increasing search time, because only relevant databases/sigs within databases will actually be loaded? |
On Wed, Jun 23, 2021 at 04:21:34PM -0700, Tessa Pierce Ward wrote:
am I right in assuming that with manifests, we can start doing terrible things like running a gather against all available databases (rather than only using databases with the right ksize/scaled) without measurably increasing search time, because only relevant databases/sigs within databases will actually be loaded?
indeed you are correct :)
|
… permissions in zip (#1633) * w/zip, compress manifests, and set good default file permissions * [MRG] alias `--nucleotide`, `--no-nucleotide` for moltype args. (#1632) * add --nucleotide to moltype args * test --nucleotide; update bad moltypes error msg Co-authored-by: Tessa Pierce Ward <[email protected]>
🎉 |
This PR adds manifests to sourmash, in order to support rapid
Index.select(...)
(including with picklists) on very large databases. The core idea is that manifests contain the signature metadata, including (potentially) the storage location for the raw signature, and can be repeatedly selected upon before actually loading the data.In particular, this enables cool stuff like rapid extraction of signatures from arbitrarily large collections per #1365.
This PR:
CollectionManifest
class that stores manifest "rows" containing signature metadata, and can be serialized easily (currently to CSV);SignaturePicklist
to do matching on manifest metadata rather than on signature objects only, viamatches_manifest_row(...)
;MultiIndex
class into an in-memory class that uses manifests.sourmash sig manifest
function to create manifests.SBT
(via [EXP] add manifests to SBTs #1597) andZipFileLinearCollection
on both write and read.One important part of this PR is the addition of
Index._signatures_with_internal()
which must be defined byIndex
classes that support manifest creation; it yields internal locations for signatures in a way that is not currently supported by any other mechanism. This is something that we'll need to revisit in the future, and perhaps refactor intoStorage
classes.TODO:
signatures_with_internal
ZipFileLinearIndex
with and without manifestsa demonstration
Below, we run a prefetch of a signature against 48,000 signatures in a zipfile collection, which yields 13 matches (for this query). We then use the picklist functionality in
sourmash sig extract
with thematch_md5
column from the prefetch results to extract just the relevant signatures.With this PR, the
sourmash sig extract
takes about 2 seconds. With #1588 (which supports picklists but not any special zipfile interaction), it takes a few minutes.