Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speeding up MAGsearch, and/or reducing disk load for loading many .sig.gz files #1858

Open
ctb opened this issue Mar 2, 2022 · 6 comments
Open

Comments

@ctb
Copy link
Contributor

ctb commented Mar 2, 2022

From slack:

titus:
hi @luizirber the MAGsearch snakefile that I inherited from Tessa uses 36 threads. Any particular reason not to push that higher?

luizirber:
nope, you can go higher. Just need to get it scheduled into the queue
(and think about how hard you hit the shared storage 😬 )
(I think I was running with 24 locally, but I have the data in HDDs and it tremendously hammered the poor disks)
the shared storage should handle it better (more disks, data spread around), but it does need to pull the 5TB somehow...
(silly index idea: make a linear index of Nodegraphs, one per sig, and only load the data from sigs if they are above a threshold?)
(most of the time is spent on parsing JSON right now...)


A few misc thoughts here -

  • I think the linear index of nodegraphs idea is equivalent to a linear index of single-node SBTs?
  • a different generically useful idea would be to support a mapping between a manifest and an nodegraph, so that if a nodegraph has sufficient containment, all the attached manifest signatures are loaded and searched.
  • the challenge with using rayon and rust in the sra search code is that it you need whatever file formats there are to be loadable in rust; parallelizing SRA search via snakemake #1664 is a way of doing this all in Python but is probably not that fast.
  • right now, the SRA signature files contain sketches for ksizes 21, 31, and 51, all in a single file, which means that all JSON for all three sketches is loaded. this is unnecessary for any given search and presumably disk I/O would be lessened if we could use a zipfile format to load in just the single ksize; this is being discussed in what's required for Rust-based search parallelism (aka greyhound)? #1752.
@camillescott
Copy link
Contributor

re: your last point, JSON can be stream-parsed, with libraries like ijson. I do this in goetia to avoid keeping loads of sketches in memory: https://github.com/camillescott/goetia/blob/master/goetia/signatures.py#L119

note that efficiency gains here would be dependent on the order of the sketches in the signature.

@ctb
Copy link
Contributor Author

ctb commented Mar 4, 2022

random thought - I wonder if SqliteIndex #1808 would be an effective way to store and search SRA sketches?

@luizirber
Copy link
Member

random thought - I wonder if SqliteIndex #1808 would be an effective way to store and search SRA sketches?

I think it might work for each sig (a sqlite DB per sig), but for all sketches I think we will hit limitations in sqlite (more info: https://www.sqlite.org/limits.html)

But the current compressed JSON approach is also not scaling very well when multiple sketches are present per signature, especially if we store HLL or other sketch types in the sig...

@luizirber
Copy link
Member

re: your last point, JSON can be stream-parsed, with libraries like ijson. I do this in goetia to avoid keeping loads of sketches in memory: https://github.com/camillescott/goetia/blob/master/goetia/signatures.py#L119

note that efficiency gains here would be dependent on the order of the sketches in the signature.

(sourmash used ijson before the oxidation, what is old is new again =])

@luizirber
Copy link
Member

  • a different generically useful idea would be to support a mapping between a manifest and an nodegraph, so that if a nodegraph has sufficient containment, all the attached manifest signatures are loaded and searched.

This is what I did in sourmash-bio/sra_search@617f246 , currently building a Nodegraph cache for all 600k+ SRA metagenomes sigs (trending to be ~30GB in the end, with 1MB Nodegraph per SRA dataset)

@ctb
Copy link
Contributor Author

ctb commented Mar 5, 2022

random thought - I wonder if SqliteIndex #1808 would be an effective way to store and search SRA sketches?

I think it might work for each sig (a sqlite DB per sig), but for all sketches I think we will hit limitations in sqlite (more info: https://www.sqlite.org/limits.html)

yes, each sig, or maybe all three ksizes in one. load time is really minimal with SqliteIndex and the query time seems good. I'll look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants