-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speeding up MAGsearch, and/or reducing disk load for loading many .sig.gz files #1858
Comments
re: your last point, JSON can be stream-parsed, with libraries like ijson. I do this in note that efficiency gains here would be dependent on the order of the sketches in the signature. |
random thought - I wonder if SqliteIndex #1808 would be an effective way to store and search SRA sketches? |
I think it might work for each sig (a sqlite DB per sig), but for all sketches I think we will hit limitations in sqlite (more info: https://www.sqlite.org/limits.html) But the current compressed JSON approach is also not scaling very well when multiple sketches are present per signature, especially if we store HLL or other sketch types in the sig... |
(sourmash used ijson before the oxidation, what is old is new again =]) |
This is what I did in sourmash-bio/sra_search@617f246 , currently building a Nodegraph cache for all 600k+ SRA metagenomes sigs (trending to be ~30GB in the end, with 1MB Nodegraph per SRA dataset) |
yes, each sig, or maybe all three ksizes in one. load time is really minimal with |
From slack:
titus:
hi @luizirber the MAGsearch snakefile that I inherited from Tessa uses 36 threads. Any particular reason not to push that higher?
luizirber:
nope, you can go higher. Just need to get it scheduled into the queue
(and think about how hard you hit the shared storage 😬 )
(I think I was running with 24 locally, but I have the data in HDDs and it tremendously hammered the poor disks)
the shared storage should handle it better (more disks, data spread around), but it does need to pull the 5TB somehow...
(silly index idea: make a linear index of Nodegraphs, one per sig, and only load the data from sigs if they are above a threshold?)
(most of the time is spent on parsing JSON right now...)
A few misc thoughts here -
The text was updated successfully, but these errors were encountered: