Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework database construction and release process to use manifests #1652

Closed
ctb opened this issue Jul 3, 2021 · 10 comments · Fixed by #1907
Closed

rework database construction and release process to use manifests #1652

ctb opened this issue Jul 3, 2021 · 10 comments · Fixed by #1907

Comments

@ctb
Copy link
Contributor

ctb commented Jul 3, 2021

Our latest database release is pretty nice, but life is also getting much more complicated ;). The process @bluegenes (mostly) and I are using to build/release GTDB looks something like this:

  • get latest GTDB release spreadsheet
  • find DNA signatures in wort, as available (tessa)
  • build protein signatures as needed, and also build DNA signatures that aren't in wort (tessa)
  • construct new .zip collections from those signatures (tessa)
  • build SBTs and LCA databases, along with catalogs

With sourmash 4.2.0, we can now start using picklists with sourmash sig cat to construct the zipfile collections, and manifests are automatically produced from that point on. Future improvements such as lazy signature loading using manifests/manifests-of-manifests can also make the actual disk I/O etc much simpler when selecting from large collections.


Separately, @luizirber has a different database building process that builds the "genbank microbial" databases, based (I think) mostly on the wort output as well as the assembly_report file.

This is all getting to be a lot to manage, and partly as a result we haven't produced a new genbank microbial database in a while.

I chatted briefly with tessa about the idea of starting to use manifests as a starting point for building databases.

The basic idea goes something like this -

  • produce manifests for existing directories of signatures (e.g. all of wort, protein databases, custom "patch update" directories, etc)
  • build some kind of custom script that takes in a list of directories + manifests for files underneath them and builds a zip file
  • have this script do some pre-scanning so that we can say "we want " and it will quickly tell us which signatures are missing
@ctb
Copy link
Contributor Author

ctb commented Jul 5, 2021

I've been working on this on and off, as part of #1654 and #1619 and the scripts in #1641 (comment). I think the ManifestOfManifests that I prototyped in #1619 might be the right approach - we would have it manage a sqlite database that contained the file locations and their last-scanned mtime, and then update that as we go.

I attach an early draft script that does the file finding and the manifest reloading/updating, but it doesn't actually update the database.

update-sqlite3-mom-dirs.py.txt

Other than fleshing out the ManifestOfManifests class, I think the main feature to add here would be something that let you add manifests from a small list of files to the database, without having to rescan the entire directory, which is going to be quite slow for large collections...

@ctb
Copy link
Contributor Author

ctb commented Jul 6, 2021

(come to think of it, this is an excellent situation where plugins might come in handy. Most people using sourmash will probably not be constructing databases with hundreds of thousands of files!)

@ctb
Copy link
Contributor Author

ctb commented Jul 6, 2021

In chatting with @bluegenes, we broke it down into two different issues:

  • scanning large directories for new files
  • scanning existing files to see if they've been updated

The latter is pretty straightforward, but the former is going to be pretty slow.

Tessa suggested that we chunk the signatures into (let's say) 20k signatures, and store them in zipfiles or directories. That seems pretty workable - 50 such files would be a million sigs! - but we'd need some infrastructure around that, too...

@ctb
Copy link
Contributor Author

ctb commented Jul 7, 2021

More conversations with @bluegenes - I think our first attempt to improve database construction will end with:

  • collecting wort files into chunks of ~10k signature files, in zip files;
  • building a manifest-of-manifests (MoMs) for wort based on those files, and keeping it updated;
  • building MoMs for ad hoc additions
  • writing a script that, given a list of MoMs and a picklist of identifiers, tells you (a) whether all the identifiers are found and then if they are, (b) gives you the MoMs and indices that load those locations and lets you do whatever (e.g. create a specific zipfile collection).

@ctb
Copy link
Contributor Author

ctb commented Jul 9, 2021

🎉 well, that was easy! https://github.com/ctb/2021-sourmash-mom

@ctb
Copy link
Contributor Author

ctb commented Jul 10, 2021

OK, got it all working, it seems? Some of the output numbers are incorrect so I'll fix that :)

tl;dr ~1 minute and ~1 GB to get my grubby little paws on all the GTDB genome signatures for RS202.

I loaded all of the signatures from /group/ctbrowngrp/irber/data/wort-data/wort-genomes/sigs into 120 chunked zipfiles, each containing 10k accessions/30k signatures, under 2021-sourmash-mom/wort-genomes.zips/. (292 GB currently.)

Using the Manifest Of Manifests codebase I then created manifests-of-manifests (MoMs or moms) containing the combined manifests of all the zip files, as well as a (much) smaller collection of signatures that @bluegenes created to round out things that wort didn't have.

% ./create-mom.py wort-genomes.zips.db wort-genomes.zips/
% ./create-mom.py tessa.db tessa.sigs/

This produced two sqlite databases that are not terribly large:

-rw-r--r-- 1 ctbrown ctbrown  44K Jul 10 07:22 tessa.db
-r--r--r-- 1 ctbrown ctbrown 804M Jul 10 07:13 wort-genomes.zips.db

and then I grabbed the latest set of GTDB accessions:

% gunzip -c /group/ctbrowngrp/gtdb/gtdb-rs202.metadata.csv.gz | csvtk cut -f accession | cut -c 4- > gtdb-rs202.idents.csv

(and then had to unmangle the column header, but whatever).

Finally, I asked for all matching signatures across all mom databases (in this case, I didn't actually extract them, as that would have taken an hour or two :).

% /usr/bin/time -v ./mom-extract-
sigs.py --picklist gtdb-rs202.idents.csv:accession:identprefix wort-genomes.zips
.db tessa.db
picking column 'accession' of type 'identprefix' from 'gtdb-rs202.idents.csv'
loaded 258406 distinct values into picklist.
Loading MoM sqlite database wort-genomes.zips.db...
wort-genomes.zips.db contains 3617967 rows total. Selecting ksize/moltype/picklist...
...776310 matches remaining for 'wort-genomes.zips.db' (50.6s)
Loading MoM sqlite database tessa.db...
tessa.db contains 201 rows total. Selecting ksize/moltype/picklist...
...201 matches remaining for 'tessa.db' (0.0s)
---
loaded 776511 rows total from 2 databases.
for given picklist, found 258406 matches to 258406 distinct values
There are 201 distinct rows across all MoMs.
No output options; exiting.
        Command being timed: "./mom-extract-sigs.py --picklist gtdb-rs202.idents.csv:accession:identprefix wort-genomes.zips.db tessa.db"
        User time (seconds): 42.93
        System time (seconds): 12.92
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:55.57
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1122912
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 640557
        Voluntary context switches: 1379
        Involuntary context switches: 309667
        Swaps: 0
        File system inputs: 3292944
        File system outputs: 1894776
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ctb
Copy link
Contributor Author

ctb commented Jul 12, 2021

One mildly neat realization coming out of #1664 is that for this kind of manifest stuff, the size of the underlying data doesn't matter - we have about the same number of signatures for the SRA as we do for genbank genomes, so all of the manifest stuff will work just fine. It's only the actual search that will be slower for the SRA data because it's so much bigger than the genbank genomes.

@ctb
Copy link
Contributor Author

ctb commented Jul 14, 2021

trying out using the NCBI assembly_summary.txt files, ref sourmash-bio/databases#7, it all seems pretty straightforward --

csvtk cut -f 1 -t assembly_summary.txt > idents.csv
...
./mom-extract-sigs.py -k 31 --dna --picklist ../genbank_build/idents.csv:ident:ident wort-genomes.zips.db  \
    --save-unmatched=../genbank_build/xxx.csv

which gave

picking column 'ident' of type 'ident' from '../genbank_build/idents.csv'
loaded 390 distinct values into picklist.
Loading MoM sqlite database wort-genomes.zips.db...
wort-genomes.zips.db contains 3617967 rows total. Running select......
...355 matches remaining for 'wort-genomes.zips.db' (12.1s)
---
loaded 355 rows total from 1 databases.
Wrote 35 unmatched values from picklist to '../genbank_build/xxx.csv'
for given picklist, found 355 matches to 390 distinct values
WARNING: 35 missing picklist values.
There are 355 distinct rows across all MoMs.
No output options; exiting.

note the added feature,

Wrote 35 unmatched values from picklist to '../genbank_build/xxx.csv'

which will be important for automation :)

@ctb
Copy link
Contributor Author

ctb commented Jul 14, 2021

all of genbank => 88k missing signatures from wort, it seems.

% ./mom-extract-sigs.py --picklist ../genbank_build/gb.idents.csv:ident:identprefix -k 31 --save-unmatched ../genbank_build/gb.nomatch.csv wort-genomes.zips.db
picking column 'ident' of type 'identprefix' from '../genbank_build/gb.idents.csv'
loaded 1033057 distinct values into picklist.
Loading MoM sqlite database wort-genomes.zips.db...
wort-genomes.zips.db contains 3617967 rows total. Running select......
...947203 matches remaining for 'wort-genomes.zips.db' (13.8s)
---
loaded 947203 rows total from 1 databases.
Wrote 88439 unmatched values from picklist to '../genbank_build/gb.nomatch.csv'
for given picklist, found 944618 matches to 1033057 distinct values
WARNING: 88439 missing picklist values.
There are 947203 distinct rows across all MoMs.
No output options; exiting.

@ctb
Copy link
Contributor Author

ctb commented Mar 30, 2022

closed by #1907 which combined with #1891 to make it very straightforward to build databases out of wort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant