Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelizing SRA search via snakemake #1664

Open
ctb opened this issue Jul 12, 2021 · 7 comments
Open

parallelizing SRA search via snakemake #1664

ctb opened this issue Jul 12, 2021 · 7 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jul 12, 2021

So I did a thing... https://github.com/ctb/2021-sourmash-greymake

Building off some conversations with @bluegenes, and inspired a bit by the work on manifests of manifests #1652, I roughed out a simple system for search parallelism in Python. Basically a (very) poor version of greyhound #1226 :).

The basic idea is:

  • take large collection of signatures, e.g. wort-sra
  • zip them into many (but not too many) collections of, say, 100-10000 signatures each => many large zip files
  • build a manifest of manifests across this entire collection (handled by 2021-sourmash-mom)
  • for any given set of search parameters (k-mer size, moltype, picklist), create picklists for each zip file and save them (handled by mom-select-to-picklists.py)
  • then, use snakemake (in the 2021-sourmash-greymake repo) to run one search process per zip file.
  • combine search results ex post facto.

Thoughts and questions:

  • the snakemake approach used here can build on top of a Rust-based parallel search of databases, too; this is potentially a way of making use of more than one node at a time.
  • since the rust side can only read from sigs files, MAGsearch can't make use of the zip files, which means we'd be keeping an extra copy of all the data around, est 10 TB.
@ctb
Copy link
Contributor Author

ctb commented Jul 12, 2021

(current size of wort-sra directory: 9.4T)

@luizirber
Copy link
Member

Nice! snakemake was the first version of mag search, so everything old is new again =]

(by that time we didn't have good support like now for doing parallel searches, so curious to see where the greymake goes)

@ctb
Copy link
Contributor Author

ctb commented Jul 12, 2021

Nice! snakemake was the first version of mag search, so everything old is new again =]

😆

(by that time we didn't have good support like now for doing parallel searches, so curious to see where the greymake goes)

It's meant to be a code-lite proof of concept on top of other deeper infrastructure, just like 99% of everything I do :). Looking forward to making it more robust and performant by refactoring sourmash underneath!

@ctb
Copy link
Contributor Author

ctb commented Mar 24, 2022

misc thought: we could pretty easily use the abspath manifests in #1891 to build sra_search siglist files, allowing picklists etc - and do this all in Snakemake.

This mostly provides a way to interconnect our management of all of these millions of files with the same underlying set of catalogs/manifests, which is nice, but not game changing.

Although one nice feature would be to able to subselect a set of SRA records based on their metadata, as we are beginning to explore over in https://github.com/dib-lab/2022-sra-gather.

@luizirber
Copy link
Member

Food for thought: snakemake supports Rust rules, so it can also be used to wire the Rust parts of https://github.com/sourmash-bio/sra_search while using all the other nice snakemake features too.

@ctb
Copy link
Contributor Author

ctb commented Apr 22, 2022

TIL about rust-script https://rust-script.org/

@ctb
Copy link
Contributor Author

ctb commented Apr 22, 2022

This could be a nice thing to provide for sourmash more generally. Neat stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants