speeding up MAGsearch, and/or reducing disk load for loading many .sig.gz files #1858

ctb · 2022-03-02T17:25:16Z

From slack:

titus:
hi @luizirber the MAGsearch snakefile that I inherited from Tessa uses 36 threads. Any particular reason not to push that higher?

luizirber:
nope, you can go higher. Just need to get it scheduled into the queue
(and think about how hard you hit the shared storage 😬 )
(I think I was running with 24 locally, but I have the data in HDDs and it tremendously hammered the poor disks)
the shared storage should handle it better (more disks, data spread around), but it does need to pull the 5TB somehow...
(silly index idea: make a linear index of Nodegraphs, one per sig, and only load the data from sigs if they are above a threshold?)
(most of the time is spent on parsing JSON right now...)

A few misc thoughts here -

I think the linear index of nodegraphs idea is equivalent to a linear index of single-node SBTs?
a different generically useful idea would be to support a mapping between a manifest and an nodegraph, so that if a nodegraph has sufficient containment, all the attached manifest signatures are loaded and searched.
the challenge with using rayon and rust in the sra search code is that it you need whatever file formats there are to be loadable in rust; parallelizing SRA search via snakemake #1664 is a way of doing this all in Python but is probably not that fast.
right now, the SRA signature files contain sketches for ksizes 21, 31, and 51, all in a single file, which means that all JSON for all three sketches is loaded. this is unnecessary for any given search and presumably disk I/O would be lessened if we could use a zipfile format to load in just the single ksize; this is being discussed in what's required for Rust-based search parallelism (aka greyhound)? #1752.

camillescott · 2022-03-03T01:38:31Z

re: your last point, JSON can be stream-parsed, with libraries like ijson. I do this in goetia to avoid keeping loads of sketches in memory: https://github.com/camillescott/goetia/blob/master/goetia/signatures.py#L119

note that efficiency gains here would be dependent on the order of the sketches in the signature.

ctb · 2022-03-04T14:07:37Z

random thought - I wonder if SqliteIndex #1808 would be an effective way to store and search SRA sketches?

luizirber · 2022-03-04T16:06:12Z

random thought - I wonder if SqliteIndex #1808 would be an effective way to store and search SRA sketches?

I think it might work for each sig (a sqlite DB per sig), but for all sketches I think we will hit limitations in sqlite (more info: https://www.sqlite.org/limits.html)

But the current compressed JSON approach is also not scaling very well when multiple sketches are present per signature, especially if we store HLL or other sketch types in the sig...

luizirber · 2022-03-04T16:06:44Z

re: your last point, JSON can be stream-parsed, with libraries like ijson. I do this in goetia to avoid keeping loads of sketches in memory: https://github.com/camillescott/goetia/blob/master/goetia/signatures.py#L119

note that efficiency gains here would be dependent on the order of the sketches in the signature.

(sourmash used ijson before the oxidation, what is old is new again =])

luizirber · 2022-03-04T16:10:28Z

a different generically useful idea would be to support a mapping between a manifest and an nodegraph, so that if a nodegraph has sufficient containment, all the attached manifest signatures are loaded and searched.

This is what I did in sourmash-bio/sra_search@617f246 , currently building a Nodegraph cache for all 600k+ SRA metagenomes sigs (trending to be ~30GB in the end, with 1MB Nodegraph per SRA dataset)

ctb · 2022-03-05T12:42:22Z

random thought - I wonder if SqliteIndex #1808 would be an effective way to store and search SRA sketches?

I think it might work for each sig (a sqlite DB per sig), but for all sketches I think we will hit limitations in sqlite (more info: https://www.sqlite.org/limits.html)

yes, each sig, or maybe all three ksizes in one. load time is really minimal with SqliteIndex and the query time seems good. I'll look into it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speeding up MAGsearch, and/or reducing disk load for loading many .sig.gz files #1858

speeding up MAGsearch, and/or reducing disk load for loading many .sig.gz files #1858

ctb commented Mar 2, 2022 •

edited by luizirber

Loading

camillescott commented Mar 3, 2022

ctb commented Mar 4, 2022

luizirber commented Mar 4, 2022

luizirber commented Mar 4, 2022

luizirber commented Mar 4, 2022

ctb commented Mar 5, 2022

speeding up MAGsearch, and/or reducing disk load for loading many .sig.gz files #1858

speeding up MAGsearch, and/or reducing disk load for loading many .sig.gz files #1858

Comments

ctb commented Mar 2, 2022 • edited by luizirber Loading

camillescott commented Mar 3, 2022

ctb commented Mar 4, 2022

luizirber commented Mar 4, 2022

luizirber commented Mar 4, 2022

luizirber commented Mar 4, 2022

ctb commented Mar 5, 2022

ctb commented Mar 2, 2022 •

edited by luizirber

Loading