Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] implement a simple ZipFileLinearIndex class #1349

Merged
merged 95 commits into from
Apr 3, 2021
Merged

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Feb 25, 2021

This PR implements linear indices that are just zipped collections of signatures. Fixes #1320.

Example usage:

zip -r test.zip podar-ref/*.fa.sig
sourmash compare test.zip
sourmash search podar-ref/1.fa.sig test.zip

A few notes -

  • this is not an indexed database, in that search is linear..
  • there are no internal sourmash commands for adding signatures or saving zip files; use zip at the command line.
  • the file is tested to see if it's an SBT first.
  • unlike indexed databases, the signatures in the zipfile can be incompatible ksizes/moltypes; selectors work as normal.
  • in order to be loaded, the filename must end in .zip

I'm a little concerned at how easy this was, TBH... what am I missing? 😅

Questions:

  • do we want to support zipfile loading in directory traversal (e.g. sourmash sig describe .) and list-of-file inputs (--from-file)?

TODO:

  • add tests for command line usage
  • add tests for API usage, include force
  • update docs to describe new feature slash expand "collections of signatures" documentation vs "indexes"
  • evaluate performance on e.g. all of GTDB
  • look at DirectoryIndex issue re caching and/or expansion of Index API to support "this sig was loaded from this source, specifically" - NOTE, done over in [MRG] Rework the find functionality for Index classes #1392 currently.

@codecov
Copy link

codecov bot commented Feb 25, 2021

Codecov Report

Merging #1349 (6ce3ce5) into latest (1dc9426) will increase coverage by 0.11%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1349      +/-   ##
==========================================
+ Coverage   89.33%   89.44%   +0.11%     
==========================================
  Files         123      123              
  Lines       18913    19105     +192     
  Branches     1463     1471       +8     
==========================================
+ Hits        16895    17089     +194     
+ Misses       1785     1784       -1     
+ Partials      233      232       -1     
Flag Coverage Δ
python 94.59% <100.00%> (+0.08%) ⬆️
rust 67.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tests/test_sbt.py 100.00% <ø> (ø)
src/sourmash/index.py 92.10% <100.00%> (+1.33%) ⬆️
src/sourmash/sbt.py 83.89% <100.00%> (+0.23%) ⬆️
src/sourmash/sourmash_args.py 96.04% <100.00%> (+0.08%) ⬆️
tests/test_api.py 100.00% <100.00%> (ø)
tests/test_cmd_signature.py 100.00% <100.00%> (ø)
tests/test_index.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1dc9426...6ce3ce5. Read the comment docs.

@ctb ctb changed the title [WIP] implement a simple ZipFileLinearIndex class [MRG] implement a simple ZipFileLinearIndex class Feb 26, 2021
@ctb
Copy link
Contributor Author

ctb commented Feb 26, 2021

Interested in comments, @bluegenes @luizirber! I still have a few tests to write and some documentation to update, but curious what you think.

So far I really like the convenience of supporting everything in one file, and it's particularly nice that it supports incompatible signatures (unlike the indexed data types, for now).

@bluegenes
Copy link
Contributor

love this -- much easier to distribute one file than many, plus compression! A couple thoughts:

  • sourmash sig describe --> yes, would be nice to support usage, but output may not be as useful? Likely too many sigs for full output, and sourmash sig describe <zipfile> | head would no longer represent the parameters of all sigfiles, since it's possible to incorporate incompatible signatures. If enabled, maybe we could summarize the sig params (e.g. num sigs per ksize/alpha/scaled ) as tsv or tabbed terminal output? Or have a quick default option that gives info for first sig, and summarize option that gives full summary? Somewhat related to sourmash index describe (or similar) #1190.

  • loading via --from-file --> yes, please!

  • Future thoughts: Are you set on zipfile generation being external? If not --

    • Zipped output from search or gather(when outputting matches/unmatched) could be useful?
    • Would we want to enable building a sigs.zip file from a file with a list of sigs?
    • If sketching a bunch of sigs, could we write directly to a zip file? Not sure I would actually do this for non --singleton cases as it would eliminate parallelization, but just throwing it out there.

@ctb
Copy link
Contributor Author

ctb commented Feb 26, 2021

Love the comments, excellent brainstorming :). I will probably punt most of them to new issues as out of my desired scope for this PR, but I think they're great next steps!

Note that outputting '.gz' formats is just as good as a zip file when compression is desired; zip files are mostly nice for situations where you have a directory of things to distribute, and want to be able to update them etc (the random access case).

I like the idea of a sourmash sig summary command for collections, FWIW.

@bluegenes
Copy link
Contributor

Note that outputting '.gz' formats is just as good as a zip file when compression is desired; zip files are mostly nice for situations where you have a directory of things to distribute, and want to be able to update them etc (the random access case).

Can you expand on this a bit? I guess you mean that for compression, we could gzip the individual sigfiles, and then optionally zip the directory, which enables distribution/updating/etc. But I guess I haven't been thinking about compression much-- do we currently enable reading/writing gz sigfiles? Or is it just that gz is easier to enable for sig output? How do the pros (space) / cons (loading time??) shake out? Sry, naiive compression q's -- happy to drop this in a different issue if better :).

@luizirber
Copy link
Member

luizirber commented Feb 26, 2021

On the plus side, this is a simple implementation, with easy-to-use commands in the CLI, and covers an increasingly common use case.

But =]

  • It's putting together Index and Storage. Nothing necessarily wrong with that, but the benefit of going thru the Storage layer is being able to share it with other indices (especially SBTs at the moment).
  • This is an additional file format to support (on top of signatures, LCA, SBT, and probably file-of-files should be considered a format too). And the file generation is "outside" our control, so it's harder to version.
  • In [WIP] SBT scaffold #1201 I was adding a new version of the SBT (v7, I think?) that splits internal nodes and leaves/signatures. This could be used here (where signatures/ would be a dir inside the zip file with the signatures), and make this and the SBT zip formats compatible.
  • We already do quite a bit of adhoc checks to support Zip/JSON SBTs, and this PR adds another one (distinguish between SBT and Linear zip files)
  • There is a very old (and unfinished) PR for saving full sigs into the SBT [WIP] save full signature in SBT leaves #366. There was no Index or selectors at that point, but would be similar to this use case.

Questions:

  • are there any performance concerns with the way I'm doing this?

Not at first sight, but we will only know when we measure =]

do we want to support loading .sig.gz files from within zip files?

This already works transparently, because niffler opens the data (compressed or uncompressed) on the Rust side. No?


So, I'm +1 on this PR. I think it fragments a bit the codebase and makes it harder to maintain in the long run, but it is useful NOW, which is more important =]

@ctb
Copy link
Contributor Author

ctb commented Feb 27, 2021

some responses and thoughts -

Note that outputting '.gz' formats is just as good as a zip file when compression is desired; zip files are mostly nice for situations where you have a directory of things to distribute, and want to be able to update them etc (the random access case).

@bluegenes:

Can you expand on this a bit? I guess you mean that for compression, we could gzip the individual sigfiles, and then optionally zip the directory, which enables distribution/updating/etc. But I guess I haven't been thinking about compression much-- do we currently enable reading/writing gz sigfiles? Or is it just that gz is easier to enable for sig output? How do the pros (space) / cons (loading time??) shake out? Sry, naiive compression q's -- happy to drop this in a different issue if better :).

sourmash can load multiple signatures from .sig.gz files just fine. So we could output matches etc in .sig.gz format it all we wanted was compression.

So the question becomes, what does a .zip file bring that goes beyond compression?

The main advantages that I see -

  • the ability to list and manipulate the file contents from outside sourmash using a variety of tools
  • storing non .sig files, e.g. files that contain taxonomy info per adding lineage manipulation & taxonomy reporting in more places in sourmash? #969 (comment), or maybe manifests (the way SBTs have the .sbt.json "directory" files)
  • faithfully representing directory structures and sensible filenames (which can be good or bad, but it's nice to have the option)
  • a fairly good intuitive explanation for "what is this file?" - it's merely a list of signature files that you put in a zip file.

That is, zip files are a package of compressed files and directories, not a compressed package of signatures.

@luizirber:

  • It's putting together Index and Storage. Nothing necessarily wrong with that, but the benefit of going thru the Storage layer is being able to share it with other indices (especially SBTs at the moment).

This is an excellent point, and one I want to dig into. Are you think maybe that LinearIndex could be stored in a ZipStorage, or an lca.json file could be stored on IPFS, potentially?

  • This is an additional file format to support (on top of signatures, LCA, SBT, and probably file-of-files should be considered a format too). And the file generation is "outside" our control, so it's harder to version.
  • We already do quite a bit of adhoc checks to support Zip/JSON SBTs, and this PR adds another one (distinguish between SBT and Linear zip files)

I'm not too concerned about the versioning, since zipfile format is pretty stable, is supported by Python, and presumably is supported by Rust.

But your point about this being Yet Another file format to support is well taken. How many more if statements do we want in sourmash_args? 😆

  • In [WIP] SBT scaffold #1201 I was adding a new version of the SBT (v7, I think?) that splits internal nodes and leaves/signatures. This could be used here (where signatures/ would be a dir inside the zip file with the signatures), and make this and the SBT zip formats compatible.

So this brings up a key question: do we want to require that there be a manifest? A required manifest can add loading overhead and maintenance overhead (if we update the things in the zip file, we need to update the manifest).

In the greyhound use case we simply want to be able to iterate across the signatures as quickly as possible, and I think the current SBT format is ill suited for that.

k, I'll take a look.

do we want to support loading .sig.gz files from within zip files?

This already works transparently, because niffler opens the data (compressed or uncompressed) on the Rust side. No?

Yes, it should, but we need to recognize .sig.gz files in the internal logic of the Zipfile loading.

So, I'm +1 on this PR. I think it fragments a bit the codebase and makes it harder to maintain in the long run, but it is useful NOW, which is more important =]

This matches my hot take (once I'd finished getting it working) - good functionality, but not that cleanly integrated into sourmash. There's no immediate hurry, so let's keep talking.

So a follow-on question:

What about having some logic where the Storage tells us what kind of file it is? Then we can implement ZipStorage logic to tell us if it's an SBT or something else.

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

Wrote a bit about how a DirectoryIndex might work in terms of supporting signature origin tracking, because I'm wondering if we could expand the Storage API in sbt_storage.py to support a wider range of retrieval approaches (and then maybe build DirectoryIndex on top of ZipStorage as a replacement for this ZipFileLinearIndex class).

tl;dr I am wondering if we should provide a URL designator for signature origins?

In doing so, I came across this interesting discussion about how to write URLs for files in zip archives (from frictionlessdata that does a nice breakdown of some issues with origin URLs.

Which led me to this issue on bdbags 🤣 so it's all getting very complicated!

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

while I'm weaving a web of issues this morning...

  • when thinking about manifests #1352, a nice clear use case for zip files emerged: the manifest can be part of the collection.
  • would be interesting to trial-implement a SQL storage (perhaps in sqlite?) for signature collections.

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

further brain breaking fun thoughts:

aren't indexes separate from collections? indexes are indices ON collections of sketches...

  • collections support dumb traversal, along with underlying signature/sketch storage and retrieval
  • Index objects provide fast queries of various types (search, gather)

thinking down this road a bit further,

  • collections would support specific storage types (file system, zip, ipfs, redis, sql) - we could provide a URL schema that is externally resolvable, at least by other commands within sourmash
  • manifests would be information about collections, optionally provided
  • indexes would be separate mixins that support better querying

there is a slightly messy interplay with sketch types and selectors and tagging -

  • indexes are generally going to be on a particular subset of compatible sketches within a collection;
  • so a search using an index would be able to look at a subset of your collection;
  • and you might want to refine the search results by selecting on tags, or taxonomy, or something, AFTER the search is done;
  • you run into the (far distant future) problem that there maybe an index on some subset of a collection, and you want to use it for searching those, but not for the things not included in the index;

but this all seems like buying ourselves conceptual trouble for the future that might not be that interesting in practice...

note we reframe sourmash index as generating an index on a collection, e.g. an LCA/revindex index on a collection of signatures. Of course some of the indices (LCA currently) store the signatures internally, others don't need to but may (SBTs).

This further opens up the idea of commands to generate manifests on collections separately from indices.

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

Here's some simple code to load a pile o' signatures into a zipfile:

#! /usr/bin/env python
import sys
import zipfile
import sourmash
import argparse


def main():
    p = argparse.ArgumentParser()
    p.add_argument('zipfile')
    p.add_argument('signatures', nargs='+')
    args = p.parse_args()

    zf = zipfile.ZipFile(args.zipfile, 'w', zipfile.ZIP_DEFLATED)

    n = 0
    for filename in args.signatures:
        print(f"reading signatures from '{filename}'")
        for sig in sourmash.load_file_as_signatures(filename):
            md5 = 'signatures/' + sig.md5sum() + '.sig'
            sigstr = sourmash.save_signatures([sig])
            zf.writestr(md5, sigstr)
            n += 1

    print(f"wrote {n} signatures to '{args.zipfile}'")

    return 0


if __name__ == '__main__':
    sys.exit(main())

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

more idle thinking, would the results of prefetch be ...another index? or is that too query-specific? hmm. probably. But they could be a collection.

...perhaps with a new kind of index, a summary sketch that contains aggregate hashes plus their abundances.

or maybe that's just an lca/revindex index.

which leads me to think that the output of commands like prefetch and sourmash gather --save-matches could usefully be an index like LCA/revindex instead of a zipfile - outputting an index no longer restricts downstream usage, since all of those outputs will be treated the same by sourmash signature search/manipulation commands.

@bluegenes
Copy link
Contributor

Could the result of prefetch be an alternate manifest or a list of names that can be used to select signatures to load from the existing collection/index?

Would be really nice to avoid saving a matches.sig intermediate file, if it's not too hard to just load the relevant sigs from a collection.

@ctb ctb changed the title [MRG] implement a simple ZipFileLinearIndex class [WIP] implement a simple ZipFileLinearIndex class Mar 5, 2021
@ctb ctb mentioned this pull request Apr 1, 2021
6 tasks
@ctb
Copy link
Contributor Author

ctb commented Apr 3, 2021

With #1406 and #1420 being merged, loading of databases and selection of compatible signatures has improved muchly; while there are still many to be improved, I think this PR is reasonably well scoped and fully baked.

I'll take care of creating new issues based on the conversations above before merge, but for now I am happy to say: ready for review! pls review at your leisure, @bluegenes! (because I know you're enthusiastic about this feature :)

@ctb ctb changed the title [WIP] implement a simple ZipFileLinearIndex class [MRG] implement a simple ZipFileLinearIndex class Apr 3, 2021
@ctb
Copy link
Contributor Author

ctb commented Apr 3, 2021

(the test failures appear to be related to something else going on with the docs; I'll look into this separately)

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good to me, just a few questions /comments!

doc/command-line.md Outdated Show resolved Hide resolved
tests/test_index.py Show resolved Hide resolved
tests/test_index.py Show resolved Hide resolved
src/sourmash/index.py Show resolved Hide resolved
src/sourmash/index.py Show resolved Hide resolved
@bluegenes
Copy link
Contributor

bluegenes commented Apr 26, 2021

Here's some simple code to load a pile o' signatures into a zipfile:

#! /usr/bin/env python
import sys
import zipfile
import sourmash
import argparse


def main():
    p = argparse.ArgumentParser()
    p.add_argument('zipfile')
    p.add_argument('signatures', nargs='+')
    args = p.parse_args()

    zf = zipfile.ZipFile(args.zipfile, 'w')

    n = 0
    for filename in args.signatures:
        print(f"reading signatures from '{filename}'")
        for sig in sourmash.load_file_as_signatures(filename):
            md5 = 'signatures/' + sig.md5sum() + '.sig'
            sigstr = sourmash.save_signatures([sig])
            zf.writestr(md5, sigstr)
            n += 1

    print(f"wrote {n} signatures to '{args.zipfile}'")

    return 0


if __name__ == '__main__':
    sys.exit(main())

FYI this may have issues with duplicated md5sums, see #1483 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distributing zip files of signatures, alone.
3 participants