-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Building databases from assembly_summary.txt #11
Conversation
Decision time! There are accessions listed in the What to do in these cases?
|
is there a "quick test" target that I can run? |
A minimal domains:
- archaea
sig_store: /group/ctbrowngrp/irber/data/wort-data/wort-genomes/sigs
db_ksizes:
- 21
db:
- refseq
bfsize:
- 1e6 |
Starting |
Databases calculated. They are at SBTtotal 618G GenBank
RefSeq
LCAtotal 2.3G GenBank
RefSeq
|
Some comments:
|
hadn't looked through this PR carefully @luizirber! The PR I put in (#15) has some bits for enabling multiple alphabets that could just be integrated here. |
closing b/c we've completely changed up database building (for the better!) with sourmash-bio/sourmash#1885 et al. |
Fixes #7
wort
. If the signature is not available, fall back to calculate it by figuring out the download URL and streaming the data.assembly_summary.txt
is downloaded for each GenBank/RefSeq domain, and used to create a catalog, a file with paths to where signatures for each accession are located.--from-file
, currently using a weird syntax to extract the first signature and pass to index (and pass the others with--from-file
) becauseindex
requires one signature...TODO
taxid4index
file....sbt.zip
(so I can use it with gather-to-opal)compute
rulecheckpoint
to calculate missing signatures from thesig_collection
dir?