Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add manifests to support fast Index.select(...) and lazy loading #1590

Merged
merged 124 commits into from
Jun 24, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jun 13, 2021

This PR adds manifests to sourmash, in order to support rapid Index.select(...) (including with picklists) on very large databases. The core idea is that manifests contain the signature metadata, including (potentially) the storage location for the raw signature, and can be repeatedly selected upon before actually loading the data.

In particular, this enables cool stuff like rapid extraction of signatures from arbitrarily large collections per #1365.

This PR:

  • creates a CollectionManifest class that stores manifest "rows" containing signature metadata, and can be serialized easily (currently to CSV);
  • extends SignaturePicklist to do matching on manifest metadata rather than on signature objects only, via matches_manifest_row(...);
  • refactors the MultiIndex class into an in-memory class that uses manifests.
  • provides a sourmash sig manifest function to create manifests.
  • provides manifest functionality for SBT (via [EXP] add manifests to SBTs #1597) and ZipFileLinearCollection on both write and read.

One important part of this PR is the addition of Index._signatures_with_internal() which must be defined by Index classes that support manifest creation; it yields internal locations for signatures in a way that is not currently supported by any other mechanism. This is something that we'll need to revisit in the future, and perhaps refactor into Storage classes.

TODO:

  • revisit signatures_with_internal
  • write tests for ZipFileLinearIndex with and without manifests
  • write tests for manifest generation on a variety of saved indices

a demonstration

Below, we run a prefetch of a signature against 48,000 signatures in a zipfile collection, which yields 13 matches (for this query). We then use the picklist functionality in sourmash sig extract with the match_md5 column from the prefetch results to extract just the relevant signatures.

With this PR, the sourmash sig extract takes about 2 seconds. With #1588 (which supports picklists but not any special zipfile interaction), it takes a few minutes.

% sourmash prefetch podar-ref/63.fa.sig gtdb-r202.genomic-reps.k31.zip -o 63.prefetch.csv
...
(takes a few minutes, yields a prefetch.csv with 13 results)
...
% sourmash sig extract --picklist 63.prefetch.csv:match_md5:md5prefix8 \
         gtdb-r202.genomic-reps.k31.zip -o /tmp/abc.zip
picking column 'match_md5' of type 'md5prefix8' from '63.prefetch.csv'
loaded 13 distinct values into picklist.
found manifest when loading gtdb-r202.genomic-reps.k31.zip
found manifest when loading gtdb-r202.genomic-reps.k31.zip
.signatures() found manifest!
loaded 13 sigs from 'gtdb-r202.genomic-reps.k31.zip'
loaded 13 total that matched ksize & molecule type
extracted 13 signatures from 1 file(s)
for given picklist, found 13 matches of 13 total

@ctb ctb changed the base branch from latest to add/picklist_selectors June 13, 2021 16:03
@codecov
Copy link

codecov bot commented Jun 13, 2021

Codecov Report

Merging #1590 (39abe57) into add/picklist_selectors (5ac4671) will decrease coverage by 1.53%.
The diff coverage is 59.61%.

Impacted file tree graph

@@                    Coverage Diff                     @@
##           add/picklist_selectors    #1590      +/-   ##
==========================================================
- Coverage                   89.07%   87.53%   -1.54%     
==========================================================
  Files                          76       77       +1     
  Lines                        6735     6933     +198     
  Branches                     1209     1251      +42     
==========================================================
+ Hits                         5999     6069      +70     
- Misses                        519      637     +118     
- Partials                      217      227      +10     
Flag Coverage Δ
python 87.53% <59.61%> (-1.54%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/sig/__main__.py 86.96% <5.00%> (-3.49%) ⬇️
src/sourmash/index.py 82.04% <59.70%> (-17.00%) ⬇️
src/sourmash/cli/sig/manifest.py 80.00% <80.00%> (ø)
src/sourmash/sig/picklist.py 71.13% <92.85%> (-5.98%) ⬇️
src/sourmash/cli/sig/__init__.py 100.00% <100.00%> (ø)
src/sourmash/commands.py 86.25% <100.00%> (ø)
src/sourmash/lca/command_summarize.py 87.82% <100.00%> (ø)
src/sourmash/sourmash_args.py 93.71% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ac4671...39abe57. Read the comment docs.

@ctb ctb changed the title [WIP] add manifests to support fast Index.select(...) and lazy loading [MRG] add manifests to support fast Index.select(...) and lazy loading Jun 23, 2021
@ctb
Copy link
Contributor Author

ctb commented Jun 23, 2021

Ready for review @luizirber @bluegenes

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, really excited for this!

src/sourmash/index.py Outdated Show resolved Hide resolved
@ctb
Copy link
Contributor Author

ctb commented Jun 23, 2021

two things I'm thinking of maybe adding, either to this or a new PR -

  • look into setting the default file permissions on SOURMASH-MANIFEST.csv so that it's a+r; see this link.
  • the zipfile is opened in uncompressed mode (zipfile.ZIP_STORED) and then the signatures are stored in it in compressed format (using sourmash.signature.save_signatures(..., compression=1)). This means that the manifest file is also stored in uncompressed mode, because we don't compress it before saving it in the zip file. I'm thinking we might support manifests with .gz filenames, either in addition to or in place of the current SOURMASH-MANIFEST.csv filename in zipfile collections.

update: added in #1633

@bluegenes
Copy link
Contributor

bluegenes commented Jun 23, 2021

am I right in assuming that with manifests, we can start doing terrible things like running a gather against all available databases (rather than only using databases with the right ksize/scaled) without measurably increasing search time, because only relevant databases/sigs within databases will actually be loaded?

@ctb
Copy link
Contributor Author

ctb commented Jun 23, 2021 via email

… permissions in zip (#1633)

* w/zip, compress manifests, and set good default file permissions

* [MRG] alias `--nucleotide`, `--no-nucleotide` for moltype args. (#1632)

* add --nucleotide to moltype args

* test --nucleotide; update bad moltypes error msg

Co-authored-by: Tessa Pierce Ward <[email protected]>
@luizirber luizirber self-requested a review June 24, 2021 18:03
@ctb ctb merged commit 9dbd8b5 into latest Jun 24, 2021
@ctb ctb deleted the add/picklist_zf_manifests branch June 24, 2021 18:07
@ctb
Copy link
Contributor Author

ctb commented Jun 24, 2021

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants