Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Building databases from assembly_summary.txt #11

Closed
wants to merge 7 commits into from

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Jul 18, 2020

Fixes #7

  • Using signatures calculated with wort. If the signature is not available, fall back to calculate it by figuring out the download URL and streaming the data.
  • The assembly_summary.txt is downloaded for each GenBank/RefSeq domain, and used to create a catalog, a file with paths to where signatures for each accession are located.
  • SBT and LCA index rules read catalog with --from-file, currently using a weird syntax to extract the first signature and pass to index (and pass the others with --from-file) because index requires one signature...

TODO

  • generate a taxid4index file...
  • and add to .sbt.zip (so I can use it with gather-to-opal)
  • add back compute rule
  • probably use a checkpoint to calculate missing signatures from the sig_collection dir?
  • LCA index rule
  • generate lineages (need to download NCBI taxonomy too)
  • decide about skipping or using older version of assembly if current data is missing

@luizirber
Copy link
Member Author

luizirber commented Jul 19, 2020

Decision time! There are accessions listed in the assembly_summary.txt that are listed in GenBank, but don't have data available yet. For example, https://www.ncbi.nlm.nih.gov/assembly/GCA_013096725.2

What to do in these cases?

  • Use a previous version, if available? For GCA_013096725.2 it would work, for GCA_901007655.1 it wouldn't.
  • Skip it, and then the database reflects "data from asssembly_summary.txt that was available for download at a specific date"?

@ctb
Copy link
Contributor

ctb commented Jul 20, 2020

is there a "quick test" target that I can run?

@luizirber
Copy link
Member Author

snakemake -s Snakefile.assembly -j1 with most of the config.assembly.yml commented out should be fast and test both SBT and LCA creation.

A minimal config.assembly.yml for farm:

domains:
- archaea
sig_store: /group/ctbrowngrp/irber/data/wort-data/wort-genomes/sigs
db_ksizes:
- 21
db:
- refseq
bfsize:
- 1e6

@luizirber
Copy link
Member Author

refseq-bacteria databases took ~3 hours and 105GB of RAM to build. Sounds like this could be optimized (but not now).

Starting genbank-bacteria now, which is 3.5x larger than refseq...

@luizirber
Copy link
Member Author

Databases calculated. They are at ~irber/sourmash_databases/outputs in farm

SBT

total 618G

GenBank

100M genbank-archaea-x1e4-k21.sbt.zip
100M genbank-archaea-x1e4-k31.sbt.zip
100M genbank-archaea-x1e4-k51.sbt.zip
196M genbank-archaea-x1e5-k21.sbt.zip
196M genbank-archaea-x1e5-k31.sbt.zip
196M genbank-archaea-x1e5-k51.sbt.zip
475M genbank-archaea-x1e6-k21.sbt.zip
476M genbank-archaea-x1e6-k31.sbt.zip
477M genbank-archaea-x1e6-k51.sbt.zip

 25G genbank-bacteria-x1e4-k21.sbt.zip
 24G genbank-bacteria-x1e4-k31.sbt.zip
 24G genbank-bacteria-x1e4-k51.sbt.zip
 40G genbank-bacteria-x1e5-k21.sbt.zip
 40G genbank-bacteria-x1e5-k31.sbt.zip
 41G genbank-bacteria-x1e5-k51.sbt.zip
 83G genbank-bacteria-x1e6-k21.sbt.zip
 89G genbank-bacteria-x1e6-k31.sbt.zip
 91G genbank-bacteria-x1e6-k51.sbt.zip

1.6G genbank-fungi-x1e4-k21.sbt.zip
1.6G genbank-fungi-x1e4-k31.sbt.zip            
1.7G genbank-fungi-x1e4-k51.sbt.zip
1.8G genbank-fungi-x1e5-k21.sbt.zip
1.9G genbank-fungi-x1e5-k31.sbt.zip
1.9G genbank-fungi-x1e5-k51.sbt.zip
3.2G genbank-fungi-x1e6-k21.sbt.zip
3.3G genbank-fungi-x1e6-k31.sbt.zip
3.3G genbank-fungi-x1e6-k51.sbt.zip

303M genbank-protozoa-x1e4-k21.sbt.zip
316M genbank-protozoa-x1e4-k31.sbt.zip
327M genbank-protozoa-x1e4-k51.sbt.zip
336M genbank-protozoa-x1e5-k21.sbt.zip
348M genbank-protozoa-x1e5-k31.sbt.zip
359M genbank-protozoa-x1e5-k51.sbt.zip
561M genbank-protozoa-x1e6-k21.sbt.zip
577M genbank-protozoa-x1e6-k31.sbt.zip
589M genbank-protozoa-x1e6-k51.sbt.zip

 77M genbank-viral-x1e4-k21.sbt.zip
 78M genbank-viral-x1e4-k31.sbt.zip
 78M genbank-viral-x1e4-k51.sbt.zip
137M genbank-viral-x1e5-k21.sbt.zip
138M genbank-viral-x1e5-k31.sbt.zip
140M genbank-viral-x1e5-k51.sbt.zip
266M genbank-viral-x1e6-k21.sbt.zip
269M genbank-viral-x1e6-k31.sbt.zip
272M genbank-viral-x1e6-k51.sbt.zip

RefSeq

 31M refseq-archaea-x1e4-k21.sbt.zip
 31M refseq-archaea-x1e4-k31.sbt.zip
 32M refseq-archaea-x1e4-k51.sbt.zip
 56M refseq-archaea-x1e5-k21.sbt.zip
 56M refseq-archaea-x1e5-k31.sbt.zip
 56M refseq-archaea-x1e5-k51.sbt.zip
 136M refseq-archaea-x1e6-k21.sbt.zip
 136M refseq-archaea-x1e6-k31.sbt.zip
137M refseq-archaea-x1e6-k51.sbt.zip

7.1G refseq-bacteria-x1e4-k21.sbt.zip
7.1G refseq-bacteria-x1e4-k31.sbt.zip
7.1G refseq-bacteria-x1e4-k51.sbt.zip
 12G refseq-bacteria-x1e5-k21.sbt.zip
 12G refseq-bacteria-x1e5-k31.sbt.zip
 12G refseq-bacteria-x1e5-k51.sbt.zip
 27G refseq-bacteria-x1e6-k21.sbt.zip
 27G refseq-bacteria-x1e6-k31.sbt.zip
 27G refseq-bacteria-x1e6-k51.sbt.zip

 72M refseq-fungi-x1e4-k21.sbt.zip
 73M refseq-fungi-x1e4-k31.sbt.zip
 74M refseq-fungi-x1e4-k51.sbt.zip
 85M refseq-fungi-x1e5-k21.sbt.zip
 86M refseq-fungi-x1e5-k31.sbt.zip
 87M refseq-fungi-x1e5-k51.sbt.zip
159M refseq-fungi-x1e6-k21.sbt.zip
160M refseq-fungi-x1e6-k31.sbt.zip
161M refseq-fungi-x1e6-k51.sbt.zip

 22M refseq-protozoa-x1e4-k21.sbt.zip
 22M refseq-protozoa-x1e4-k31.sbt.zip
 23M refseq-protozoa-x1e4-k51.sbt.zip
 25M refseq-protozoa-x1e5-k21.sbt.zip
 26M refseq-protozoa-x1e5-k31.sbt.zip
 27M refseq-protozoa-x1e5-k51.sbt.zip
 46M refseq-protozoa-x1e6-k21.sbt.zip
 47M refseq-protozoa-x1e6-k31.sbt.zip
 48M refseq-protozoa-x1e6-k51.sbt.zip

 21M refseq-viral-x1e4-k21.sbt.zip
 21M refseq-viral-x1e4-k31.sbt.zip
 21M refseq-viral-x1e4-k51.sbt.zip
 38M refseq-viral-x1e5-k21.sbt.zip
 38M refseq-viral-x1e5-k31.sbt.zip
 39M refseq-viral-x1e5-k51.sbt.zip
 74M refseq-viral-x1e6-k21.sbt.zip
 74M refseq-viral-x1e6-k31.sbt.zip
 75M refseq-viral-x1e6-k51.sbt.zip

LCA

total 2.3G

GenBank

5.0M genbank-archaea-k21-scaled10k.lca.json.gz
5.3M genbank-archaea-k31-scaled10k.lca.json.gz
5.5M genbank-archaea-k51-scaled10k.lca.json.gz

447M genbank-bacteria-k21-scaled10k.lca.json.gz
503M genbank-bacteria-k31-scaled10k.lca.json.gz
566M genbank-bacteria-k51-scaled10k.lca.json.gz

 66M genbank-fungi-k21-scaled10k.lca.json.gz
 73M genbank-fungi-k31-scaled10k.lca.json.gz
 81M genbank-fungi-k51-scaled10k.lca.json.gz

 17M genbank-protozoa-k21-scaled10k.lca.json.gz
 19M genbank-protozoa-k31-scaled10k.lca.json.gz
 21M genbank-protozoa-k51-scaled10k.lca.json.gz

1.8M genbank-viral-k21-scaled10k.lca.json.gz
1.8M genbank-viral-k31-scaled10k.lca.json.gz
1.9M genbank-viral-k51-scaled10k.lca.json.gz

RefSeq

1.6M refseq-archaea-k21-scaled10k.lca.json.gz
1.8M refseq-archaea-k31-scaled10k.lca.json.gz
1.9M refseq-archaea-k51-scaled10k.lca.json.gz

147M refseq-bacteria-k21-scaled10k.lca.json.gz
161M refseq-bacteria-k31-scaled10k.lca.json.gz
176M refseq-bacteria-k51-scaled10k.lca.json.gz

6.9M refseq-fungi-k21-scaled10k.lca.json.gz
7.2M refseq-fungi-k31-scaled10k.lca.json.gz
7.5M refseq-fungi-k51-scaled10k.lca.json.gz

2.1M refseq-protozoa-k21-scaled10k.lca.json.gz
2.2M refseq-protozoa-k31-scaled10k.lca.json.gz
2.4M refseq-protozoa-k51-scaled10k.lca.json.gz

745K refseq-viral-k21-scaled10k.lca.json.gz
756K refseq-viral-k31-scaled10k.lca.json.gz
768K refseq-viral-k51-scaled10k.lca.json.gz

@luizirber luizirber changed the title [WIP] Building databases from assembly_summary.txt [MRG] Building databases from assembly_summary.txt Jul 27, 2020
@luizirber
Copy link
Member Author

Some comments:

  • Archaea, fungi, protozoa and viral LCA are so small that we can actually make scaled=1000 (or maybe even scaled=100 for viral?)
  • Bacteria SBT 1e6 is... quite large. We can put them in S3, but it will be expensive...

@bluegenes
Copy link
Contributor

hadn't looked through this PR carefully @luizirber! The PR I put in (#15) has some bits for enabling multiple alphabets that could just be integrated here.

@ctb
Copy link
Contributor

ctb commented Apr 2, 2022

closing b/c we've completely changed up database building (for the better!) with sourmash-bio/sourmash#1885 et al.

@ctb ctb closed this Apr 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Building genbank/refseq databases from assembly_summary.txt
3 participants