You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's an idea for how to parallelize the building of crux dbs. All worker types plus the scheduler + combiner are different types of docker images deployed with kubernetes.
scheduler has a redis database and issues jobs to workers, and keeps track of job state
obi-downloader workers get assigned a list of SRA accessions to download, downloads them in parallel, and after downloading converts them to fasta using fasterq-dump and then builds an obitools database for them, and then stores the fasta and obitools in a ceph folder
ecopcr workers get assigned a set of obitools databases, and are given a set of primers to run, and performs ecopcr against the databases using the primers, and stores the results in a ceph folder
blastn is given a set of ecopcr output queries and a set of blast databases to query against, and runs blastn in parallel, storing results in a ceph folder
combiner takes all of the blast results from ceph, combines them all (including deprelication), and then builds a bowtie2 database which is the final output stored in ceph
There are approximately 1.2 million SRA accessions for WGS projects, and ~64 NT chunks (nt.00.tar.gz etc). So blastn workers for example will receive some subset of the 1.2 million SRA accessions, plus an assignment to BLAST against one of the 64 NT chunks
The text was updated successfully, but these errors were encountered:
Here's an idea for how to parallelize the building of crux dbs. All worker types plus the scheduler + combiner are different types of docker images deployed with kubernetes.
scheduler
has a redis database and issues jobs to workers, and keeps track of job stateobi-downloader
workers get assigned a list of SRA accessions to download, downloads them in parallel, and after downloading converts them to fasta using fasterq-dump and then builds an obitools database for them, and then stores the fasta and obitools in a ceph folderecopcr
workers get assigned a set of obitools databases, and are given a set of primers to run, and performs ecopcr against the databases using the primers, and stores the results in a ceph folderblastn
is given a set of ecopcr output queries and a set of blast databases to query against, and runs blastn in parallel, storing results in a ceph foldercombiner
takes all of the blast results from ceph, combines them all (including deprelication), and then builds a bowtie2 database which is the final output stored in cephThere are approximately 1.2 million SRA accessions for WGS projects, and ~64 NT chunks (nt.00.tar.gz etc). So
blastn
workers for example will receive some subset of the 1.2 million SRA accessions, plus an assignment to BLAST against one of the 64 NT chunksThe text was updated successfully, but these errors were encountered: