Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to support lazy/streaming load with standard sourmash functions: standalone manifests #3023

Closed
ctb opened this issue Feb 21, 2024 · 0 comments · Fixed by #3027
Closed

Comments

@ctb
Copy link
Contributor

ctb commented Feb 21, 2024

@AnneliektH asked me yesterday how to best provide a list of metagenome sketches to mgmanysearch (see https://github.com/sourmash-bio/sourmash_plugin_containment_search/). I realized I wasn't 100% sure of the answer, despite having written this:

so the question becomes, how do you provide collections of large metagenomes to manysearch and fastmultigather in a single filename?
And the answer is: manifests. Manifests are a sourmash filetype that contains information about sketches without containing the actual sketch content, and they can be used as "catalogs" of sketch content.

(Part of my confusion was that the text above is being used through Rust functionality, not through standard Python loading functions.)

mgmanysearch uses standard sourmash loading functions, so I thought an investigation would be useful and lead to some add'l sourmash documentation too!

tl;dr don't use pathlists, use manifests.

the script

I wrote the following Python script:

#! /usr/bin/env python
import sys
import sourmash
import time

print(f'opening {sys.argv[1]}')
sys.stdout.flush()
mark = time.time()
idx = sourmash.load_file_as_index(sys.argv[1])
print(f'{time.time() - mark:.3f}s')
sys.stdout.flush()

print(f'selecting {sys.argv[1]}')
sys.stdout.flush()
mark = time.time()
idx = idx.select(ksize=21)
print(f'{time.time() - mark:.3f}s')
sys.stdout.flush()

print("starting...")
sys.stdout.flush()

mark = time.time()
for ss in idx.signatures():
    print(f'loaded {ss.name}')
    print(f'{time.time() - mark:.3f}s')
    sys.stdout.flush()
    mark = time.time()

the execution

and then ran it on a pathlist containing a list of filenames:

opening pathlist.txt
23.715s
selecting pathlist.txt
0.000s
starting...
loaded 139_2
0.000s
loaded 139_1
0.000s
loaded 139_3
0.000s
loaded 139_4
0.000s

and on a manifest generated with sourmash sig collect $(cat pathlist.txt) -o mf.csv -F csv

opening mf.csv
0.009s
selecting mf.csv
0.000s
starting...
loaded 139_1
4.404s
loaded 139_2
6.663s
loaded 139_3
5.751s
loaded 139_4
6.798s

results

When using pathlists, all sketches are loaded at once at the beginning, consuming All The Memory.

When using manifests, all sketches are loaded on demand, not consuming All the Memory.

other thoughts

This is another reason to use .zip files to store sketches, instead of sig.gz files; sig collect will need to load the actual sketches in sig.gz files in order to build the manifest, while the manifest is already available in .zip files.

tl;dr

  • if you have a bunch of big metagenomes to search using (e.g.) mgmanysearch,
  • and you want to make them into a list to search,
  • store them in zip files,
  • and use sig collect to build a manifest across some or all of them,
  • and then use those manifests.

TODO: verify that sig collect loads things on the command line progressively 😅

Related issues:

@ctb ctb changed the title how to support streaming load with standard sourmash functions how to support streaming load with standard sourmash functions: standalone manifests Feb 22, 2024
@ctb ctb changed the title how to support streaming load with standard sourmash functions: standalone manifests how to support lazy/streaming load with standard sourmash functions: standalone manifests Mar 5, 2024
@ctb ctb closed this as completed in #3027 Mar 20, 2024
ctb added a commit that referenced this issue Mar 20, 2024
This PR:
* fixes a minor nit in `sourmash sig collect` output where it said
"loaded 0 signatures"
* updates a lot of the documentation around standalone manifests to
encourage their use
* in tandem, modifies docs to discourage loading from
pathlists/from-files and directory hierarchies

TODO:
- [x] look at TODO item re directories in sig collect
- [x] think about adding
#3023 information into
docs about lazy loading; maybe in the advanced databases document?
- [x] update `sig manifest` docs to point out that they do not generate
standalone manifests
- [x] revisit branchwater plugin documentation to, to either make issues
or make changes
- [x] update `sig check` and `sig collect` to tell people to expand
their paths ref #3039
- [x] update docs more to recommend against pathlists and directories
per #3040

Related issues:
* sourmash-bio/sourmash_plugin_branchwater#235
* Fixes #3048
* Fixes #3009 by
recommending `sig collect` and `sig check` instead of `sig manifest` for
making standalone manifests
* #3053
* Fixes #3023
* Fixes #3039
* Fixes #3040

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant